Oja’s learning rule
Published:
📝 Idea / Thought
The invention of the Oja’s rule is to overcome the problem of unbounded growth when training the neural network with the Hebbian learning rules
\[y_i = \sum_j w_{ij} x_j\]Oja’s rule
In Oja’s rule, it used the multiplicative constrains to re-normalize the weights, so that it will converge to the first principal component. But keep in mind that it’s only for the one neuron Oja’s model; If the model has multiple neurons following Oja’s rule, it will converge to find the subspace of the two eigenvectors, not exactly the two eigenvectors.
How Oja’s rule is derived
we already know the classic Hebbian learning is \(\mathbf{y = w x}\) When updated weight, the learning rule is: \(\mathbf{\Delta w = yx}\)
- it means that when the presynaptic neurons $\mathbf{x}$ and postsynaptic neurons $\mathbf{y}$ fire together, they will update the weights more. We can also write it in this way: \(\mathbf{w(t+1) = w(t) + \eta yx}\) To avoid unbounded weight growth, for each step, the updated weight is re-normalized by the scalar term itself:
\(\mathbf{w(t+1) = \Delta w' = \frac{w'}{||w'||}}\) As we already know \(\mathbf{w' = w(t) + \eta yx}\) We can calculate the $\mathbf{||w’||^2}$ as \(\begin{align} \mathbf{||w'||^2 = (w(t))^2 + 2\eta yxw' + (\eta yx)^2} \\ =\mathbf{||w'||^2 + 2 \eta y^2 + \eta^2 y^2 ||x||^2} \\ =1 + 2\eta y^2 + \eta^2 y^2 ||x||^2 \end{align}\)
Then, let $\epsilon = 2\eta y^2 + \eta^2 y^2 ||x||^2$ : \(\begin{align} ||w'||^2 = 1+\epsilon, \\ ||w'||= \sqrt{1+\epsilon} \\ = (1+\epsilon)^{\frac{1}{2}} \end{align}\) Then, get the approximate value of $||w’||$ by Taylor series: \(\begin{align} ||w'||= \sqrt{1+\epsilon} \approx 1 + \frac{1}{2}\epsilon+ O(\eta^2) \approx 1 + \eta y^2 + O(\eta^2) \end{align}\) Let $\delta = \eta y^2 + O(\eta^2)$, and approximate by Taylor series: \(\frac{1}{||w'||} = (1+\delta)^{-1} \approx 1-\delta \approx 1- \eta y^2 - O(\eta^2)\) After we have all elements, we can calculate $\mathbf{w(t+1)}$ now: \(\begin{align} \mathbf{w(t+1) = \frac{w'}{||w'||} = (w(t)+\eta yx) (1-\eta y^2 - O(\eta^2))} \\ \approx \mathbf{w(t)(1-\eta y^2) + \eta yx(1-\eta y^2)} \\ \approx \mathbf{w(t) + \eta yx - \eta y^2 w(t)- \eta^2y^3x} \end{align}\) Finally, we moved $\mathbf{w(t)}$ to the left hand side: \(\begin{align} \mathbf{w(t+1) - w(t) = \eta(yx-y^2w) - \eta^2 y^3x} \\ \end{align}\) The last term will become $O(\eta^2)$, we will have Oja’s rule: \(\mathbf{\Delta w = \eta(yx - y^2w)}\)
Here is the multiple neuron version of the Oja’s learning rule \(\Delta w_{ij} = \alpha(x_j y_i -y_i \sum_{k=1}^m w_{kj}y_k)\)
Regularization term in Oja’s rule
In the last term of the Oja’s rule, the $\mathbf{y^2w}$ can be $diag(y)^2 w$ or $\mathbf{YY^T}$.
- diagonal term: All the neurons learn the first PCs
- off-diagonal term: Neurons learn PC subspace
