From PCA to Hebbian learning
Published:
📝 Idea / Thought
What is PCA?
From the Pattern Recognition and Machine learning:
Orthogonal projection of the data onto a lower dimensional linear space; linear projection that minimizes the average projection cost.
The algebraical connections between PCA and Hebbian learning
A famous quote for principal of Hebbain learning is that
Neurons fire together, wire together.
It means that when neurons are co-activated, the synaptic weights are strengthened. This simple rule can be written as: \(\Delta w = yx\) And we know that the activation function for postsynaptic neuron $Y$ can be written as
\(y = w^\top x\) We also know that the covariance of presynaptic neuron $X$ is \(\mathbf{C = \mathbb{E}(xx^\top)}\)
Then, the update of the synaptic weight $w$ can be written in: \(\Delta w = \eta yx = \eta(w^\top x)x\)
To combine the term, the update of $w$ can be written as: \(\mathbb{E}(\Delta w) = \eta \mathbf{\mathbb{E}(xx^\top)w} = \eta \mathbf{Cw}\)
Then there is a question: what’s the link between $w$ and the maximum variance direction of the data?
We can write down the expression of the variance first (for 0 mean, $\mathbb{E}[x]=0$): \(\begin{align} Var(Y) = \mathbb{E}(Y^2) - (\mathbb{E}Y)^2 \\ = \mathbb{E}\mathbf{(w^\top x)^2} \\ = \mathbb{E}\mathbf{(w^\top x x^\top w)} \\ = \mathbf{w^\top C w} \end{align}\)
The covariance matrix $C$ can be decomposed as:
\[C = V \Lambda V^\top\]This equation works because $C$ is a symmetric matrix. The matrix $V$ is the collection of basis vector $v$ and $\Lambda$ is the diagonal of scalar values, which represents the much or less variance distribution on different vector $v$.
The $V$ can be written as: \(V = \begin{bmatrix} \vert & \vert \\ v_1 & v_2 \\ \vert & \vert \end{bmatrix}\)
The $\Lambda$ is: \(\Lambda = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix}\)
To overcome the unbounded growth of Hebbian learning weight, if we add the regularization on $|w|$, we get Oja’s rule:
\[\Delta w = yx - y^2w\]At the equilibrium point ($\Delta w = 0$ ), the weights $w$ stop changing and become
\[\begin{align} \mathbb{E}(\Delta w) = 0 \\ \eta \mathbf{Cw} = \mathbf{(w^\top C w)w} \end{align}\]To explain the equation $Cw = \lambda w$ in geometry, it means that covariance matrix project into a unit vector $w$ and it is equivalent to scale up/down along this direction of vector $w$.
To better understand the relationship why $\lambda = w^\top C w$, there is an easy decomposition on two steps:
- step 1: $Cw$ is a transformation of covariance matrix. The transformation is to use the data distribution to reshape the vector $w$ to make it a new vector here $v = Cw$.
- step 2: $w^\top v$, is a projection from previous step 1 transformation back to original direction, to see how much information is lost. If the $w$ is the eigenvector, it captures the maximum possible variance. The $w^\top w = 1$ is the result of normalization.
