Review on Explicit Memory for Language Modeling ($\text{Memory}^3$)

Memory3: Language Modeling with Explicit Memory

The authors formalize the explicit memory by proposing $\text{Memory}^3$ which has implicit memory (model parameters), working memory (context key-values), and explicit memory.

The explicit memories can be seen as retrievable model parameters, externalized knowledge, or sparsely-activated neural circuits.

to alleviate the issue of knowledge traversal. Knowledge traversal happens when the LLM wastefully invokes all its parameters (and thus all its knowledge) each time it generates a token. As an analogy, it is unreasonable for humans to recall everything they learned whenever they write a word.

The combined cost of LLM training and inference can be seen as the cost of encoding the knowledge from text data into various memory formats, plus the cost of reading from these memories during inference.

\[\sum_{\text{knowledge } k} \min_{\text{format } m} \text{cost}_{\text{write}}(k, m) + n_k \cdot \text{cost}_{\text{read}}(k, m)\]

$k$ : the target knowledge
$m$ : The format of stored knowledge
$n_k$ : the number of extraction from the storage

Knowledge Shape

It is note worthing that the knowledge is encoded in various shapes

plain text (RAG) → explicit memory → model parameter

LLM training consumes so much data and energy [121, 77]. We want to rescue LLMs from this poor condition by equipping it with an explicit memory mechanism as efficient as that of humans.

REALM [49] augments a BERT model with one retrieval step to solve QA tasks
Retro [16] enhances auto-regressive decoding with multiple rounds of retrieval, once per 64 tokens. The retrieved texts are injected through a two-layer encoder and then several cross-attention layers in the decoder.
Retro++ [113] explores the scalability of Retro by reproducing Retro up to 9.5B parameters.
The most common MoE design [40] hosts multiple MLP layers in each Transformer block and routes each token to a few MLPs with the highest allocation score predicted by a linear classifier.
QMoE [41] are introduced to alleviate the memory burden of MoE. Despite the sparse routing, the boost in parameter efficiency is usually bounded by 4 ∼ 32.
the Arctic model [98], one of the sparsest MoE LLM in recent years, has an active parameter ratio of about 3.5%. Similarly, the Mixture of Depth architecture processes each token with an adaptive subset of the model layers. The implementations can be based on early exit [37] or top-k routing [94], reducing the amount of compute to 12.5 ∼ 50%.
The model Deja Vu [75] trains a low-cost network for each MLP/attention layer that predicts the relevance of each neuron/head at this layer to each token. Then, during inference, Deja Vu keeps the top 5 ∼ 15% MLP neurons and 20 ∼ 50% attention heads for each token.

Stop Reading

While reading this paper, I realized that the definitions and theories in this work are ill-defined in that the definition of knowledge or subgraphs have less meaning.

For me, the readability of this work is not enough to keep reading. Therefore, I finish the reading.

Review on Explicit Memory for Language Modeling ($\text{Memory}^3$)

Knowledge Shape

Related Methods

Stop Reading