📙 Chapter 1. Neuron
How can we interpret neurons in deep learning?
In the previous section, we introduced neurons in deep learning by comparing neurons in neuroscience. The previous section only introduced neurons in a linear weight and bias. With the same interpretation, we introduce several neurons in deep learning modules including multi layer perceptron (MLP), convolutional neural network (CNN), recurrent neural network, and generative pretrained models (GPT). In addition, we also provide a neuron perspective on key-value memory in GPT. We believe this interpretation of neurons can further help to understand the general computation of deep neural networks.
Let see the most basic module, weights in a linear layer. The weights are a matrix whose rows are the outputs for neurons and the columns for the synapses of these neurons. You can think that each row (neuron) computes independently the input signal $x$.
With the assumption that $W \in \mathbb{R}^{\textcolor{red}{m} \times \textcolor{blue}{n}}$
MLP is a series of linear neurons with activations in between them. The activations are thresholds to pass the information with valid signals. For example, ReLU cuts negative signals and Tanh provides gating mechanisms of signals. Therefore, in the interpretation of neurons, we can think of activations as modifiers of neurons’s outputs.
The Figure below shows two layer MLP. Each layer has two neurons and the neurons in the first layer have three synapses for each neuron, while the neurons in the second layer have two synapses for each neuron.
The Figure below shows the four layer MLP. In each layer, three neurons exist and each neuron has three synapses. Note that all the pre-neurons and post-neurons are connected with synapses. We think that is why MLP is called a densely connected neural network.
Convolutional layer is also considered as a linear layer as we can reformulate the convolution operation as a linear operation. However, we use another tool to interpret the convolutional neuron. The most frequent way of creating a convolutional layer is using the notation: (out_channels
, in_channels
, kernel
, kernel
, ..). With this notation, the out_channels
parts are the same with the rows in the linear neuron and (in_channels
, kernel
, kernel
) are the synapses for a neuron. With this interpretation, the only difference with the linear layer is (kernel
, kernel
) part where the convolution operation is applied in the 2D plane. If you assume that kernel size is 1, then you can think of a linear neuron!
out_channels
in_channels
$\times$ kernel
$\times$ kernel
out_channels
$\times$ in_channels
$ \times$ kernel
$\times$ kernel
Beside MLP and CNN, the other basic module is a recurrent neural network (RNN) which successively computes an input with linear operations in it with hidden states. Computationally, the recurrent neuron is the same with the linear neuron except input and output types. Note that the recurrent neuron should provide outputs which are re-injected to the input of that neuron. The advanced RNN modules such as LSTM
Now we discuss neurons in more advanced architectures such as GPT. GPT consists of a token embedding module, transformer decoder blocks and task specific heads such as LMHead. We verify neurons in a GPT block and leave the interpretation of other modules as they are just linear-like neurons. GPT block has two main parts MLP and self-attention (or cross-attention), and layer-norm.
MLP block is a two layer MLP whose first layer is sometimes called key and second layer sometimes is called value. We use $d_{model}$ which is the dimension of GPT representations to describe the number of neurons and synapses. The key part is a linear layer with weight \(W_{key} \in \mathbb{R}^{\textcolor{red}{4\cdot d_{model}} \times \textcolor{blue}{d_{model}}}\) whose number of neurons is four times larger than the number of synapses. On the other hand the value part is a linear layer with weight \(W_{value} \in \mathbb{R}^{\textcolor{red}{ d_{model}} \times \textcolor{blue}{4\cdot d_{model}}}\) whose number of neurons is four times smaller than the number of synapses. Therefore, this architecture is a bottleneck architecture and many researchers interpret it as a neural memory.
Consider $d_{model} = 12,288$ which is the hidden dimension size of GPT3 (175B).
In ATTN, there are four linear layers: query, key, value, output and each layer has weight \(W \in \mathbb{R}^{\textcolor{green}{d_{model}} \times \textcolor{green}{d_{model}}}\). Therefore, we can easily interpret neurons in ATTN
Consider $d_{model} = 12,288$ which is the hidden dimension size of GPT3 (175B).
Finally, we count the number of neurons and synapses in GPT3 (175B) blocks with the above interpretation
GPT3
) number of synapses : 173B GPT3
) number of neurons : 11M GPT3
) number of synapses per a neuron : 14,745
As humans have 7,000 synapses per neuron, we can think that GPT3 has a similar number or even larger number of connections between neurons. However, humans have much more neurons compared to GPT. We believe such a difference can affect the general learning process. For example, if the number of neurons is large, we can see them as an ensemble of simple knowledge. On the other hand, if the number of synapses is large, expert neurons will provide knowledge.
As we think more about a deep neural network with a metaphor to a human brain, the computational graph is highly interpretable and the individual components handle very simple computations like neurons in our brain. Recent work in interpretability is about discovering meanings of neurons and the work like this can shed light on the interpretation of the deep neural network. We believe still deep dissection of neural network can give better interpretation of deep neural networks.
Biological neurons communicate via receiving and transmitting excitatory spikes. Having received multiple spikes, neuron is more likely to produce multiple spikes of its own. However, unlike in deep learning, biological neurons do not receive spikes simultaneously. Moreover, neuron rates may coincide, and synchronize between multiple neurons, which is not presented in the deep learning.
Spikes are not equal, since the after reaching axon terminal the amount of electrical energy strongly depends on the strength of the synapse. Therefore, with the same output signals from preceding neurons, the received signal may differ. Similarly, the strength of the signal in deep learning is usually presented by the weight of synapse.
If biological neurons are examined closely, it becomes evident that there are two main classes of neurotransmitters
Deep learning relies on rate-based coding, in which each neuron’s activation is a single numeric value that models the average spiking rate in response to a given stimulus (be it from other neurons or from an external stimulus). The collective set of spiking rate values within a single layer of the network is typically organized as a vector of numbers, and this vector is referred to as the representation of an external stimulus, at that layer.
The expressiveness of rate-based neural coding is much lower than that which is possible with neural codes (representations) based on the relative time between spikes on multiple neurons. As a simple example of the existence of this type of code in biology, consider the auditory system. When sound waves reach our ears, our brain processes them to determine the type of animal, object, or phenomenon that produced the sound, but also to estimate the direction from which the sound came (localization). One way in which sound location is determined is based on the fact that sounds from the right will reach the right ear first, then the left ear. Auditory neurons close to the right and left ears exhibit spike timing that reflects this acoustic timing difference. Auditory neurons the are more medially located (near the midline of the body) receive input from neurons near both ears and are selective for the location of the sound, thanks to this temporal coding.
More generically, consider the simple example of a single neuron receiving input from two other neurons, each of which send identical input: a short train of N uniformly spaced (in time) excitatory spikes over 100 ms. All else being equal, this will generate some stereotypical response in the receiving neuron. In contrast, if one of the input neurons sent all if its spikes in the first 20 ms (of the 100 ms interval), and the other input neuron sent all of its spikes in the final 20 ms, the response of the receiving neuron is likely to be notably different. Thus, even though the spiking-rate of the input neurons is identical in each scenario (10N spikes/sec) the temporal coding is quite different, and the response of the receiving neuron can be quite different as well. Importantly, there can exist many input-output combinations when using a temporal code, even when the number of input spikes is low, constant, or both. This is what we mean by a more expressive coding scheme. In regards to AI, a model that utilizes temporal coding can conceivably perform much more complex tasks than a deep learning model with the same number of neurons.
Based on our simple description of biological and deep learning neurons, the distinction between excitatory and inhibitory neurons can be mimicked by deep learning neurons. Namely, one can mimic a biological inhibitory neuron simply by ensuring its deep learning equivalent has negative values for all synaptic weights between its axons and the dendrites of neurons to which it projects. Conversely, when mimicking a biological excitatory neuron, such synapses should always have positive weights. However, training and implementation will be easier if one simply requires that all synapses are positive-valued (perhaps by applying a ReLU function to the weights after each training iteration), and use an activation function that produces negative (positive) values for the inhibitory (excitatory) neurons.
It’s not certain, but one possibility is that the use of explicit inhibitory neurons helps to constrain the overall parameter space while allowing for the evolution or development of sub-network structures that promote fast learning. In biological systems, it is not necessary for the brain to be able to learn any input-output relationship, or execute any possible spiking sequence. We live in a world with fixed physical laws, with species whose individual members share many common within-species behavioral traits that need not be explicitly learned. Limiting the possible circuit connectivity and dynamic activity of a network is tantamount to limiting the solution space over which a training method must search.
Another potential benefit of inhibitory neurons, related to the use of structured canonical circuits as just mentioned, is that inhibitory neurons may effectively “gate off” large numbers of neurons that are unnecessary for processing of a given sample or task, thereby lowering energy requirements.
The energy efficiency of biological neurons, relative to conventional computing hardware, is largely due to two characteristics of these neurons. Firstly, biological neurons transmit only short bursts of analog energy (spikes) rather than maintaining many bits that represent a single floating-point or integer number. In conventional hardware, these bits require persistent energy flow to maintain the 0 or 1 state, unless much slower types of memory are used (non-volatile RAM).
Secondly, memory is co-located with processing in biological neurons. That is, the synaptic strengths are the long-term memory of the network (recurrent connectivity can maintain short-term memory, but that’s a topic for some other post), they are involved in the processing (spike weighting and transmission), and are very close to other aspects of the processing (energy accumulation in the cell body). In contrast, conventional hardware is regularly transmitting bits from RAM to the processor — a considerable distance and a considerable energy drain.
In the MLP section, we discuss the key-value neurons. In this section, we provide more detailed neuron-perspectives on these modules. The Figure below shows key-value computation with sizes, 3 for input, 8 for key, and 2 for value neurons
If we just assume that a neuron just provide a single output, we can interpret a neuron in a row-wise manner. That is, we don’t think about the meaning of weights and just consider the output value of each neuron. The Figure below shows the neuron-wise interpretation.
Unlike the neuron-wise interpretation where we interpret each neuron separately, we usually interpret multiple neurons together by considering the directional vector. In this case, the weights in value-neurons are not just used for aggregation of synapses, but could be interpreted by directionally meaningful information. The Figure below shows the example. In this case, the key is used to scale the vectors.