## Preamble

You might ask: “In a neural network, what is ‘memory’?”  That is, when a neural network learns a task, how does it remember that task? In essence, the answer is that memory forms during the training of the learnable parameters (i.e., the trained weights); the matrix of trained weights are the memory:

“Every machine learning model has two components: the structure of the model or the algorithm, and the model’s parameters. In the simplest case of a linear regression model the algorithm is just a linear equation. The parameters are the weights of the dependent variables. In a deep learning model, the structure is how the nodes are connected. The parameters are the weights of the inter-connections etc. The learning of the ML algorithm is stored in the parameters [i.e., the trained parameters are the “memory”]. The process of training an ML models from the training data fine-tunes the parameters. [Source: How Does a Machine Learning System Store Its Learned Memory?]

[Also, saving those models as pretrained models enables you to “permanently” store those models (memories), for later use in transfer learning tasks! ]

To explain how memory evolves in neural networks, I’ll superficially describe two neural architectures that are widely used in sequence-based and temporal applications (such as natural language processing): simple recurrent neural networks (RNN), and a specialized version of RNN, long short-term memory (LSTM). A description of these architectures provides a good explanation of the concept of neural memory.

While the descriptions of these models varies somewhat depending on the source (arXiv papers, the web, etc.) the overall concepts are rather simple, once you grasp them. Here, I’ve selected and integrated bits and pieces from various sources (with some overlap) but I think that the benefit of examining a few of these examples clarifies the overall concept of RNN, LSTM (and consequently, “neural memory”), outweighing the repetition. While there is some math the focus is on understanding, rather than implementation. I draw on several sources, so the sum of the material below should provide a satisfactory explanation.

## Neural Network Basics

Artificial neural networks (ANN) are comprised of layers of connected units called artificial neurons. A “shallow network” refers to an ANN with one input layer, one output layer, and at most one hidden layer without a recurrent connection. These networks are generally made from simple, nonlinear but simple units where the higher layers provide a more abstract representation of data and suppresses unwanted variability.

[Click image to open in new window]

[image source (click image to open in new window)]

[ANN showing bias units (click image to open in new window)]

Each of the circles above represent individual nodes (different layers may contain different numbers of nodes). In the feedforward pass, each node is acted upon (numerically modified) by the activation function  $\small \phi$, shown in the images below [and later adjusted, in response to the error, during the backpropagation step – not discussed here.] We calculate each of the layer $\small 2$ node activations based on the sum of the input values ($\small \mathbf{x}_1, \mathbf{x}_2, \dots$) times the weight matrix $\small \mathbf{W}_{ij}$, plus the bias term $\small \mathbf{b}$. For example, the value of node $\small \mathbf{a}{(2) \atop 3}$ in the image, above, representing unit $\small 3$ in layer $\small 2$), is calculated from the values of each of the input nodes $\small \mathbf{x}_i$ in the preceding layer, the weight matrices $\small \mathbf{W}_{ij}$, and the bias unit $\small \mathbf{b}$ ($\small \mathbf{x}_0 = +1$), giving the layer $\small 2$ node activations (activation values) $\small \mathbf{a}{(2) \atop i}$.

The activation function is shown in greater detail in the following two images; as mentioned, it acts on each node. The activation values on each of the hidden units/nodes in layer $\small 3$ ($\small \mathbf{a}{(3) \atop i}$) are derived from the sigmoid function ($\small \phi(\cdot)$) applied to the linear combination (sum, $\small \sum$) of the inputs to those nodes (the layer $\small 2$ activations multiplied by the weight matrix, plus the bias unit). The activations are derived using exactly the same logic as for the first hidden layer, except that the input values are the activation values $\small \mathbf{a}$ from the preceding layer, rather than the $\small \mathbf{x}$ values. …

Note that the summations, $\small \sum(\cdot)$, shown below are normally and very neatly accomplished in one step by multiplying the inputs by a transposed weight matrix (that also includes the bias term). For example, given a neuron that takes as input $\small \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$ and a bias term $\small \mathbf{b} = 1$, the output would be $\small \mathbf{h}_{W,b}(\mathbf{x}) = \phi(\sum_{i=1}^{3}\mathbf{W}_i\mathbf{x}_i + b) = \phi(\mathbf{W}^T\mathbf{x})$. Note that last expression, $\small \phi(\mathbf{W}^T\mathbf{x})$: the transposed weight matrix (including the bias) is multiplied by the input giving the output in one operation!

[image source; $\small \text{output} = \text{activation}(\sum(\text{weight} * \text{input} + \text{bias})$  (click image to open in new window)]

[image source (click image to open in new window)]

ANN with recurrent connections are called recurrent neural networks (RNN), which are capable of modelling sequential data for sequence recognition and prediction. RNN are made of high dimensional hidden states with non-linear dynamics. The structure of hidden states work as the memory of the network and state of the hidden layer at a time is conditioned on its previous state. This structure enables the RNN to store, remember, and process past complex signals for long time periods. RNNs can map an input sequence to the output sequence at the current timestep and predict the sequence in the next timestep. [Source: Recent Advances in Recurrent Neural Networks]

## Recurrent Neural Networks

[Folded, unfolded RNN (click image to open in new window)]

At each time step $\small t$, a RNN reads an input vector $\small \mathbf{x}_t$ into a hidden state vector $\small \mathbf{h}_t$ and predicts an output vector $\small \mathbf{y}_t$ [shown as $\small \mathbf{o}_t$ in the diagram above]. The state dynamic can be abstracted as a recurrent relation: $\small \mathbf{h}_t = RNN(\mathbf{h}_{t-1}, \mathbf{x}_t)$. The vanilla RNN is parameterized as follows:

$\ \ \ \ \ \ \small \mathbf{h}_t = \sigma (W_h \mathbf{h}_{t-1} + V\mathbf{x}_t + \mathbf{b}_h)$

$\ \ \ \ \ \ \small \mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y$

where $\small (W_h, W_y, V, \mathbf{b}_h, \mathbf{b}_y)$ are learnable parameters, and $\small \sigma$ is a point-wise nonlinear function.

Andrej Karpathy provides a great example of a RNN, that should make it easier to visualize the working and use of a RNN:

[image source (click image to open in new window)]

One last thing to note – the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step. This is discussed here, Why are the weights of RNN/LSTM networks shared across time?:

” The ‘shared weights’ perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment in time, this would just be a feedforward network [ANN, above]. But, I suppose another way to think about it would be as an RNN whose weights are a time-varying function (and that could let you keep the ability to process variable length sequences). If you did this, the number of parameters would grow linearly with the number of time steps. That would be a big explosion of parameters for sequences of any appreciable length. …”

## Long Short-Term Memory RNN

Although theoretically powerful, vanilla RNNs cannot learn from long-sequences due to a problem known as vanishing or exploding gradients. A powerful solution is Long Short-Term Memory (LSTM), introduced in 1997 by Hochreiter & Schmidhuber). Why “long short-term”? As described in that paper, “[LSTM] can learn to bridge time intervals in excess of 1000 steps even in the case of noisy, incompressible input sequences, without loss of short time lag capabilities.” A RNN relies on the past, without scope; long-term past inputs are forgotten. LSTM provide a short-term memory function: a LSTM block has mechanisms to enable “memorizing” information for an extended number of time steps.

LSTM introduces one more vector called “memory” $\small \mathbf{c}_t$, which, together with the state $\small \mathbf{h}_t$, specify the dynamic as: $\small (\mathbf{h}_t, \mathbf{c}_t) = LSTM(\mathbf{h}_{t-1}, \mathbf{c}_{t-1}, \mathbf{x}_t)$. In most implementations, this is decomposed further as:

$\ \ \ \ \ \ \small \mathbf{c}_t = \mathbf{f}_t ∗ \mathbf{c}_{t-1} + \mathbf{i}_t ∗ \mathbf{\widetilde{c}}_t$

$\ \ \ \ \ \ \small \mathbf{h}_t = \mathbf{o}_t ∗ tanh(\mathbf{c}_t)$

where

• $\small \mathbf{\widetilde{c}}_t$ is a candidate memory computed from the input
• $\small \mathbf{f}_t, \mathbf{i}_t, \mathbf{o}_t \in (\textbf{0},\textbf{1})$ are gates
• $∗$ denotes point-wise multiplication

• $\small \mathbf{f}_t$ determines how much the previous memory is maintained
• $\small \mathbf{i}_t$ controls how much new information is stored into memory
• $\small \mathbf{o}_t$ controls how much memory is read out

The candidate memory and the gates are typically parameterized (sometimes written as “parametrized”) as:

$\ \ \ \ \ \ \small \mathbf{\widetilde{c}}_t = tanh(W_c\mathbf{h}_{t-1} + V_c\mathbf{x_t} + \mathbf{b_c})$

$\ \ \ \ \ \ \small \begin{bmatrix} f_t \\ i_t \\ o_t \end{bmatrix} = sigm \left( \begin{bmatrix} W_f \\ W_i \\ W_o \end{bmatrix} \mathbf{h}_{t-1} + \begin{bmatrix} V_f \\ V_i \\ V_o \end{bmatrix} \mathbf{x}_t + \begin{bmatrix} \mathbf{b}_f \\ \mathbf{b}_i \\ \mathbf{b}_o \end{bmatrix} \right)$

where $\small (W_{c,f,i,o}, V_{c,f,i,o}, \mathbf{b}_{c,f,i,o})$ are learnable parameters.

[Image source (click image to open in new window)]

[Image source (click image to open in new window)]

In the transformations above, the memory cell $\small \mathbf{c}_t$ stores the “long-term” memory in the vector form. In other words, the information accumulatively captured and encoded until time step $\small t$ is stored in $\small \mathbf{c}_t$ and is only passed along the same layer over different time steps. Given the inputs $\small \mathbf{c}_t$ and $\small \mathbf{h}_t$, the input gate $\small \mathbf{i}_t$ and forget gate $\small \mathbf{f}_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $\small \mathbf{o}_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $\small \mathbf{h}_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $\small \mathbf{W}$ and $\small \mathbf{b}$ (with various subscripts): these parameters will be inferred during the training phase

Summary:

• Sources for the above:

• Symbols in bold font, above, are vectors (e.g. $\small \mathbf{y}_t$ is an output vector – these are the network’s parameters (i.e. “memory”).

• That paper glosses over glosses over some of the terms (I chose it for the clarity of the “vanilla” (basic) RNN and LSTM structures and descriptions.

• Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

• $\small W$, $\small U$ and $\small V$ are matrices of weights, updated during training. $\small W$ are the (shared – see note above) weights passed from one time step to the next; $\small U$ are the “skip connections” from the inputs to all hidden layers, and $\small V$ are the skip connections from the hidden layers to the outputs. The image below, taken from Alex Graves paper, illustrates those connections.

• Early versions of simple RNN used the sigmoid activation function (see Recent Advances in Recurrent Neural Networks for a description plus other forms), which is applied component-wise:

$\ \ \ \ \ \ \ \ \ \ \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

The “sigmoid” is a common choice, which takes a real-value and squashes it to the range [0,1]. This activation function is normally used in the output layer, where a cross-entropy loss function is used for training a classification model.

• In a LSTM, The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero/turned off/forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory. [Source: Long Short-Term Memory (LSTM): Concept]

[image sourcediscussed here. (Click image to open in new window.)]

Alex Graves shows a variation of a RNN, above, that better illustrates some additional key concepts (Generating Sequences with Recurrent Neural Networks).

• “Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence $\small \mathbf{x} = (x_1, \dots, x_T)$ is passed through weighted connections to a stack of $\small N$ recurrently connected hidden layers to compute first the hidden vector sequences $\small \mathbf{h}^n = (h{n \atop 1}, \dots, h{n \atop T})$ and then the output vector sequence $\small \mathbf{y} = (y_1, \dots, y_T)$. Each output vector $\small y_t$ is used to parameterise a predictive distribution $\small \text{Pr}(x_{t+1} \vert y_t)$ over the possible next inputs $\small x_{t+1}$. The first element $\small x_1$ of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for $\small x_2$, the first real input, with no prior information. The network is “deep” in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

“Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks, by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the “vanishing gradient” problem. In the special case that $\small N = 1$ the architecture reduces to an ordinary, single layer next step prediction RNN.”

[image source (click image to open in new window)]

[image source (click image to open in new window)]

From another perspective/source (Footnote 2 in A Beginner’s Guide to LSTMs).

All neural networks whose parameters have been optimized have memory in a sense, because those parameters are the traces of past data. In feedforward networks, that memory may be frozen in time; that is, after a network is trained, the model it learns may be applied to more data without further retraining. It is monolithic in the sense that the same memory (i.e. set of weights) is applied to all incoming data. Recurrent neural networks (RNN) are distinguished from feedforward networks by giving particular weight to events that occur in a series. While those events do not need to follow each other immediately, they are presumed to be linked, however remotely, by the same temporal thread. Feedforward networks do not make such a presumption; they treat the world as a bucket of objects without order or time.

The following material is excerpted from the main body of that article.

“In the case of feedforward networks, input examples are fed to the network and transformed into an output; with supervised learning, the output would be a label, a name applied to the input. That is, they map raw data to categories, recognizing patterns that may signal, for example, that an input image should be labeled “cat” or “elephant.” A feedforward network is trained, for example, on labeled images until it minimizes the error it makes when guessing their categories. The trained set of parameters (or weights) are collectively known as a model. A feedforward network has no notion of temporal order (“time”): the only input it considers is the current example upon which it is working.

“RNN recognize patterns in sequences of data such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets and government agencies. RNNs are even applicable to images, which can be decomposed into a series of patches and treated as a sequence. RNN take as their input not just the current input example they see, but also what they have perceived previously in time. Since these algorithms take time and sequence into account, they have a temporal dimension. The decision a recurrent net reached at time step $\small t-1$ affects the decision it will reach one moment later at time step $\small t$. RNN therefore have two sources of input, the present and the immediate past, which combine to determine how they respond to new data.

“Recurrent networks are distinguished from feedforward networks by a feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example. Mathematically, the process of carrying the memory forward is:

$\ \ \ \ \small \mathbf{h}_t = \phi(\mathbf{W}\mathbf{x}_t + \mathbf{U}\mathbf{h}_{t-1})$

“The hidden state at time step $\small t$ is $\small h_t$. It is a function of the input at the same time step $\small x_t$, modified by a weight matrix $\small W$ (like the one we used for feedforward nets) added to the hidden state of the previous time step $\small h_{t-1}$ multiplied by its own hidden-state-to-hidden-state matrix $\small U$, otherwise known as a transition matrix and similar to a Markov chain. The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower.

“The sum of the weight input and hidden state is squashed by the function $\small \phi$ – either a logistic sigmoid function or tanh, depending – which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for backpropagation. Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden state, but also of all those that preceded $\small h_{t-1}$ for as long as memory can persist.”

The most basic or “vanilla” RNN consists of the following set of equations (indexed by time-step $t$), that appear various forms in the literature and the web:

$\ \ \ \ \ \ \small \mathbf{h}_t = \sigma (\mathbf{U}_h \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1})$

$\ \ \ \ \ \ \small \mathbf{y}_t = \mathbf{O} \mathbf{h}_t$

where:

• $\small \mathbf{x}_t \in \mathbb{R}^n$ is the input of the RNN;

• $\small \mathbf{h}_t \in \mathbb{R}^k$ is called the hidden state of the RNN, and acts as a memory of the current state the network. When starting a sequence, it is set to the all zero vector ($\small \mathbf{h}_{-1} = 0$).

• $\small \mathbf{y}_t \in \mathbb{R}^p$ is the output of the RNN;

• The logistic function (here, the sigmoid function; see Recent Advances in Recurrent Neural Networks for other forms), applied component-wise:

$\ \ \ \ \ \ \small \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

• $\small \mathbf{U}_{h}, \mathbf{W}_{h}, \mathbf{O}$ are the network’s parameters. [Note that “$\small \mathbf{O}$”, here, is “$\small \mathbf{V}$” elsewhere in this post; i.e. the weight matrices of the “skip connections from the hidden layers to the outputs.]

The output of such a neural network depends on both the input $\small \mathbf{x}_t$ and the hidden state $\small \mathbf{h}_t$.

Aside:

• Source for the above: An Intrinsic Difference Between Vanilla RNNs and GRU Models plus my own notes (below).

• Undefined in that paper, I believe that I correctly interpret/describe these:

• The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the dimensions of matrices of real numbers (respectively) in input, hidden and output layers.

Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of dimensions $\small m\times n$.

• $\small \in$ denotes “element of” (in set membership: “member of”).

• In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

• Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

$\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3$

is an example of a triplet of numbers. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

• In mathematics, the real coordinate space of $n$ dimensions, written $\small \mathbb{R}^n$ is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. An element of $\small \mathbb{R}^n$ is written $\small x = (x_1, x_2, \ldots, x_n)$, where each $\small x_i$ is a real number.

### Summary of Sources

… more or less in order of citation: