From PCA to Hebbian learning

Published:

📝 Idea / Thought

What is PCA?

From the Pattern Recognition and Machine learning:

Orthogonal projection of the data onto a lower dimensional linear space; linear projection that minimizes the average projection cost.

The algebraical connections between PCA and Hebbian learning

A famous quote for principal of Hebbain learning is that

Neurons fire together, wire together.

It means that when neurons are co-activated, the synaptic weights are strengthened. This simple rule can be written as: \(\Delta w = yx\) And we know that the activation function for postsynaptic neuron $Y$ can be written as

\(y = w^\top x\) We also know that the covariance of presynaptic neuron $X$ is \(\mathbf{C = \mathbb{E}(xx^\top)}\)

Then, the update of the synaptic weight $w$ can be written in: \(\Delta w = \eta yx = \eta(w^\top x)x\)

To combine the term, the update of $w$ can be written as: \(\mathbb{E}(\Delta w) = \eta \mathbf{\mathbb{E}(xx^\top)w} = \eta \mathbf{Cw}\)

Then there is a question: what’s the link between $w$ and the maximum variance direction of the data?

We can write down the expression of the variance first (for 0 mean, $\mathbb{E}[x]=0$): \(\begin{align} Var(Y) = \mathbb{E}(Y^2) - (\mathbb{E}Y)^2 \\ = \mathbb{E}\mathbf{(w^\top x)^2} \\ = \mathbb{E}\mathbf{(w^\top x x^\top w)} \\ = \mathbf{w^\top C w} \end{align}\)

The covariance matrix $C$ can be decomposed as:

\[C = V \Lambda V^\top\]

This equation works because $C$ is a symmetric matrix. The matrix $V$ is the collection of basis vector $v$ and $\Lambda$ is the diagonal of scalar values, which represents the much or less variance distribution on different vector $v$.

The $V$ can be written as: \(V = \begin{bmatrix} \vert & \vert \\ v_1 & v_2 \\ \vert & \vert \end{bmatrix}\)

The $\Lambda$ is: \(\Lambda = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix}\)

To overcome the unbounded growth of Hebbian learning weight, if we add the regularization on $|w|$, we get Oja’s rule:

\[\Delta w = yx - y^2w\]

At the equilibrium point ($\Delta w = 0$ ), the weights $w$ stop changing and become

\[\begin{align} \mathbb{E}(\Delta w) = 0 \\ \eta \mathbf{Cw} = \mathbf{(w^\top C w)w} \end{align}\]

To explain the equation $Cw = \lambda w$ in geometry, it means that covariance matrix project into a unit vector $w$ and it is equivalent to scale up/down along this direction of vector $w$.

To better understand the relationship why $\lambda = w^\top C w$, there is an easy decomposition on two steps:

  • step 1: $Cw$ is a transformation of covariance matrix. The transformation is to use the data distribution to reshape the vector $w$ to make it a new vector here $v = Cw$.
  • step 2: $w^\top v$, is a projection from previous step 1 transformation back to original direction, to see how much information is lost. If the $w$ is the eigenvector, it captures the maximum possible variance. The $w^\top w = 1$ is the result of normalization.