## The Anatomy of our Hopfield Network

An *Associative Memory* is a dynamical system that is concerned with
the memorization and retrieval of data.

The structure of our data in the demo above is a collection of `(image, label)`

pairs, where each *image variable*
${x \in \mathbb{R}^{3N_\mathrm{pixels}}}$ is represented
as a rasterized vector of the RGB pixels and each
*label variable*
${y \in \mathbb{R}^{N_\mathrm{people}}}$ identifies
a person and is represented as a one-hot vector. Our Associative Memory additionally
introduces a third, hidden variable for the
*memories*
${z \in \mathbb{R}^{N_\mathrm{memories}}}$.

In Associative Memories, each of these variables has both an
*internal state*
that evolves in time and an *axonal state*
(an isomorphic function of the internal state)
that influences how the rest of the network evolves. This terminology of
internal/axonal is inspired by biology, where the "internal" state is analogous
to the internal current of a neuron
(other neurons don't see the inside of other neurons)
and the "axonal" state is analagous to a neuron's *firing rate*
(a neuron's axonal output is how it signals to other neurons). We denote the internal state of a variable with a hat: (i.e.,
variable ${x}$ has internal state ${\hat{x}}$, ${y}$ has internal state ${\hat{y}}$, and ${z}$ has internal state ${\hat{z}}$).

Dynamic variables in Associative Memories have two states: an internal
state and an axonal state.

We call the axonal state the *activations* and they are uniquely
defined by our choice of a scalar and convex Lagrangian function on that
variable
(see Krotov (2021),
Krotov & Hopfield (2021), and Hoover et al. (2022) for
more details). Specifically, in this demo we choose

$\begin{align*}
L_x(\hat{x}) \triangleq& \frac{1}{2} \sum\limits_i \hat{x}_i^2\\
L_y(\hat{y}) \triangleq& \log \sum\limits_k \exp (\hat{y}_k)\\
L_z(\hat{z}) \triangleq& \frac{1}{\beta} \log \sum\limits_\mu \exp(\beta \hat{z}_\mu)
\end{align*}$

These Lagrangians dictate the axonal states (activations)
of each variable.

$\begin{align*}
x &= \nabla_{\hat{x}} L_x = \hat{x}\\
y &= \nabla_{\hat{y}} L_y = \mathrm{softmax}(\hat{y})\\
z &= \nabla_{\hat{z}} L_z = \mathrm{softmax}(\beta \hat{z})
\end{align*}$

The Legendre transform of the Lagrangian defines the energy of each variable.

$\begin{align*}
E_x &= \sum\limits_i \hat{x}_i x_i - L_x\\
E_y &= \sum\limits_k \hat{y}_k y_k - L_y\\
E_z &= \sum\limits_\mu \hat{z}_\mu z_\mu - L_z\\
\end{align*}$

All variables in Associative Memories have a special Lagrangian function
that defines the axonal state and the energy of that variable.

In the above equations, ${\beta > 0}$ is an inverse
temperature that controls the "spikiness" of the energy function around each
memory
(the spikier the energy landscape, the more memories can be stored). Each of these three variables is dynamic
(evolves in time).
*The convexity of the Lagrangians ensures that the dynamics of our
network will converge to a fixed point.*

How each variable evolves is dictated by that variable's contribution to
the *global energy function*
${E_\theta(x,y,z)}$
(parameterized by weights ${\theta}$)
that is LOW when the image
${x}$, the label ${y}$, and
the memories ${z}$ are aligned
(look like real data)
and HIGH everywhere else
(thus, our energy function places real-looking data at local energy minima). In this demo we choose an energy function that allows us to manually
insert *memories*
(the `(image,label)`

pairs we want to show)
into the weights ${\theta = \left\{\theta^\mathrm{image} \in \mathbb{R}^{N_\mathrm{memories} \times 3N_\mathrm{pixels}},\;\; \theta^\mathrm{label} \in \mathbb{R}^{N_\mathrm{memories} \times N_\mathrm{people}}\right\}}$. As before, let ${\mu = \left\{1,\ldots,N_\mathrm{memories} \right\}}$,
${i = \left\{ 1,\ldots,3N_\mathrm{pixels} \right\}}$ and ${k = \left\{1,\ldots,N_\mathrm{people} \right\}}$. The global energy function in this demo is

$\begin{align}
E_\theta(x, y, z) &= E_x + E_y + E_z + \frac{1}{2} \left[ \sum\limits_\mu z_\mu (\sum\limits_i \theta^\mathrm{image}_{\mu i} - x_i)^2 - \frac{1}{2} \sum\limits_i x_i^2\right] - \lambda \sum\limits_{\mu} \sum\limits_k z_\mu \theta^\mathrm{label}_{\mu k} y_k\\
&= E_x + E_y + E_z + E_{xz} + E_{yz}
\end{align}$

We introduce ${\lambda > 1}$ to encourage the dynamics
to align with the label.

Associative Memories can always be visualized as an undirected graph.

Every associative memory can be understood as an undirected graph where *nodes*
represent dynamic variables and *edges* capture the
(often learnable) alignment between dynamic
variables. Notice that there are five energy terms in this global energy
function: one for each node
(${E_x}$, ${E_y}$, ${E_z}$), and one for each edge
(${E_{xz}}$ captures the alignment between memories
and our image and ${E_{yz}}$ captures the alignment
between memories and our label). See the diagram below for the anatomy of this network.

In fact, every associative memory can be understood as an undirected
graph where *nodes*
represent dynamic variables and *edges* capture the
(often learnable) alignment between dynamic
variables. In our energy function, we have three nodes and two edges:

**Image node ${x}$** representing
the state of our image
**Label node ${y}$** representing
the state of our label
**Hidden node ${z}$** representing
the memories
**Edge ${(x,z)}$** capturing the alignment
of the presented image to our memories
**Edge ${(y,z)}$** capturing the alignment
of the presented label to our memories

where ${\beta > 0}$ is an inverse temperature that
controls the "spikiness" of the energy function around each memory
(the spikier the energy landscape, the more memories can be stored)
and we introduce ${\lambda > 1}$ to encourage the
dynamics to align with the label. We use L2 similarity
${\;\mathrm{L2}(a,b) = - \sum\limits_j (a_j - b_j)^2}$
to capture the alignment of images to the memories stored in
${\theta^\mathrm{image}}$ and cosine similarity
${\;\mathrm{cossim}(a,b) = \sum\limits_j a_j b_j}$ (where ${||a||_1 = ||b||_1 = 1}$)
to capture the alignment of labels to memories stored in ${\theta^\mathrm{label}}$.

It is actually convenient to define a third dynamic variable ${z_\mu \triangleq -\sum\limits_{i} (\theta^{\mathrm{image}}_{\mu i} - x_i)^2 + \lambda \sum\limits_{k} \theta^\mathrm{label}_{\mu k} y_k }$ that captures the similarity of ${(x,y)}$ to the
${\mu}$th memory
(${z_\mu}$ is dynamic in the sense that it evolves
in time as a function of
${x}$ and ${y}$). This allows us to reduce visual clutter of the energy function to

This global energy function ${E_\theta(x,y,z)}$ turns
our images, labels, and memories into dynamic variables whose internal states
evolve according to the following differential equations:

$\begin{align*}
\tau_x \frac{d\hat{x}_i}{dt} &= -\frac{\partial E_\theta}{\partial x_i} = \sum\limits_\mu z_\mu \left( \theta^\mathrm{image}_{\mu i} - x_i \right)\\
\tau_y \frac{d\hat{y}_k}{dt} &= -\frac{\partial E_\theta}{\partial y_k} = \lambda \sum\limits_\mu z_\mu \theta^\mathrm{label}_{\mu k} - \hat{y}_k\\
\tau_z \frac{d\hat{z}_\mu}{dt} &= -\frac{\partial E_\theta}{\partial z_\mu} = - \frac{1}{2} \sum\limits_i \left(\theta^\mathrm{image}_{\mu i} - x_i \right)^2 + \lambda \sum\limits_k \theta^\mathrm{label}_{\mu k} y_k - \hat{z}_\mu\\
\end{align*}$

where ${\tau_x, \tau_y, \tau_z}$ define how quickly
the states evolve.

The variables in Associative Memories always seek to minimize their
contribution to a global energy function.

Note that in the demo we treat our network as an
*image generator*
by clamping the labels
(forcing ${\frac{d\hat{y}}{dt} = 0}$). We could similarly use the same Associative Memory as a
*classifier*
by clamping the image
(forcing ${\frac{d\hat{x}}{dt} = 0}$) and allowing only the label to evolve.