Literature Reviews: Neural Memory, kNN augmentation and some other recent trends.

This article is a review of several papers related to the neural memory and utilization of data embeddings (12 papers)

Reading List

Date Venue Title
2023 NeurIPS Neural Priming for Sample-Efficient Adaptation
2023 NeurIPS Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
2023 NeurIPS ResMem: Learn what you can and memorize the rest
2023 NeurIPS Accessing Higher Dimensions for UnsupervisedWord Translation
2023 NeurIPS Lift Yourself Up: Retrieval-augmented Text Generation with Self-Memory
2023 NeurIPS Exposing Attention Glitches with Flip-Flop Language Modeling
2024 Arxiv TransformerFAM: Feedback attention is working memory
2024 Arxiv Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
2017 ICLR Learning to Remember Rare Events
2019 ICCV Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder (MemAE) for Unsupervised Anomaly Detection
2020 ICLR Generalization through memorization: Nearest neighbor language models
2023 AAAI Memory-Augmented Theory of Mind Network

์œ„ ๋…ผ๋ฌธ๋“ค์— ๋Œ€ํ•œ ์ „๋ฐ˜์ ์ธ ์ƒ๊ฐ ๋ฐ ์•ž์œผ๋กœ ๋ฐฉํ–ฅ์„ฑ

  1. ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ์˜ ํ‘œํ˜„์„ ๋ชจ๋‘ ์ €์žฅํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ฐ์ดํ„ฐ์— ํ‘œํ˜„์„ ์™ธ๋ถ€์— ๋‘๊ณ  ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด ๋Œ€๋‘๋˜๊ณ  ์žˆ์Œ.
    (ResMem, MemAE, Generalization through memorization, Memory-Augmented Theory of Mind Network )
  2. Transformer์— augmentationํ•˜๋Š” ๋ฐฉ์‹์€ ๋…๋ฆฝ์ ์ธ logit์— ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋จ (Monitor, kNN ๊ฒฐํ•ฉ๋“ฑ)
  3. ํŠธ๋žœ์Šคํฌ๋จธ ๊ณ„์—ด์—์„œ Attention์˜ ๋ฏฟ์Œ์ด ์•ฝํ•ด์ง€๊ณ  RNN ๊ณ„์—ด์ด ๊ฐ•ํ•ด์ง€๊ณ  ์žˆ์Œ. (Attention Glitch, TransformerFAM, Infini-attention)

1. Neural Priming for Sample-Efficient Adaptation

์ฃผ์š” ๋ฉ”์†Œ๋“œ

  1. ํŠน์ • ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌถ์–ด์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•œ๋‹ค.
  2. ์˜ˆ์ธกํ•  ๋•Œ, ํ•ด๋‹น ํด๋Ÿฌ์Šคํ„ฐ๋กœ ํ˜•์„ฑ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

์ด ์—ฐ๊ตฌ๋Š” Pretraining ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€ ์„ค๋ช…์„ ๊ฐ€์ ธ์™€ ์˜ˆ์ธก์— ํ™œ์šฉํ•˜๋Š” Neural Priming Pool์„ ์ œ์•ˆํ•˜์˜€๋‹ค. Pretraining ์‹œ ์‚ฌ์šฉ๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ํ›„์† ์˜ˆ์ธก์— ์ž„๋ฒ ๋”ฉ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์—์„œ ์‚ฌ์šฉ๋˜๋Š” kNN ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก์— Pretraining ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Š” ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ•™์Šต์œผ๋กœ ์ธํ•œ ํŽธํ–ฅ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐ ๊ด€๋ จ์ด ์žˆ์–ด๋ณด์ธ๋‹ค.

(์ด์ „์— ๊ธฐ์˜๋‹˜์ด ๋ชจ๋ธ์ด ๊ฐ€์ •ํ•œ ๋‚ด์šฉ๊ณผ ์‹ค์ œ ๋ ˆ์ด๋ธ” ๊ฐ„์˜ ์ž…๋ ฅ ๊ธฐ์—ฌ๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, Pretraining ๋ฐ์ดํ„ฐ์˜ ์ž„๋ฒ ๋”ฉ ํ™œ์šฉ๊ณผ ๊ด€๋ จ์ด ์žˆ์Œ.)

2. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

ํ•ด๋‹น ๋ฐฉ์‹์€ ์ฝ”๋“œ ์ƒ์„ฑ ์–ธ์–ด ๋ชจ๋ธ(Code generation LLM)์—์„œ ๋งˆ์ง€๋ง‰ ๋กœ์ง“์„ ๊ฐ€๋ฆฌ๋Š” ๋งˆ์Šคํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ์„ ์•ˆ๋‚ดํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด๋Š” ์›Œํ„ฐ๋งˆํฌ์—์„œ ๋‹ค์Œ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์ถ”ํ›„ ์˜ˆ์ธก์— ์‚ฌ์šฉํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ•ด๋‹น ์—ฐ๊ตฌ๋„ ์œ ํšจํ•œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ œ์•ฝํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

3. ResMem: Learn what you can and memorize the rest

Base ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ํ›„์—๋Š” ์ถ”๊ฐ€์ ์ธ ๋ชจ๋“ˆ์„ Residual์— ์•”๊ธฐ์‹œํ‚จ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์˜ˆ์ธกํ•  ๋•Œ kNN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ์ƒ˜ํ”Œ์„ ์ฐพ์•„ Residual ๋ถ€๋ถ„์„ ๋ณด์™„ํ•œ๋‹ค.

4. Accessing Higher Dimensions for UnsupervisedWord Translation

Cooccurrence๋กœ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํšจ์œจ์ ์ธ translation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. (Coocmap)

5. Lift Yourself Up: Retrieval-augmented Text Generation with Self-Memory

Interactiveํ•œ self-play ๋ฐฉ์‹์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ œ์•ˆํ•จ.

๋” ์ข‹์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฝ‘๋„๋ก ํ•™์Šตํ•œ๋‹ค. (Target distribution์— ๋Œ€ํ•ด์„œ KL divergence๋ฅผ ์ตœ์†Œํ™”) ๋” ์ข‹์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šตํ•œ๋‹ค.

6. Exposing Attention Glitches with Flip-Flop Language Modeling

๋ฌธ์ œ์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•œ ๊ฒƒ ๊ฐ™๋‹ค. ์†Œ์Šค์™€ ํƒ€๊ฒŸ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ฐ ๋‹จ์–ด์˜ ๊ณต์ƒ(cooccurrence)์„ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ์ดํ•ด๋ฉ๋‹ˆ๋‹ค. ์„ ํƒ๋œ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์ตœ๋Œ€ํ•œ ์ผ์น˜(matching)๋ฅผ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

7. TransformerFAM: Feedback attention is working memory

๋ฉ”๋ชจ๋ผ์ด์ œ์ด์…˜์„ ์œ„ํ•œ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์—ฌ long-range token processing์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. ๊ฐ’์„ ๋ช…์‹œ์ ์œผ๋กœ ์ „๋‹ฌํ•ด์ฃผ๋Š” ๊ฒƒ ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ใ…ˆ๋†’์€ ๋ ˆ์ด์–ด์— ์žˆ๋Š” ๊ฐ’์„ ์•„๋ž˜๋กœ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Transformer๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ํŠน์ • ๋Œ€์ƒ์„ ๋”์šฑ ์˜ค๋žซ๋™์•ˆ ์ƒ๊ฐํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

8. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

์ค‘๊ฐ„์— RNN๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œํ˜„์ด ์ „๋‹ฌ๋˜๋„๋ก ๋งŒ๋“ค์—ˆ๋‹ค. ์š”์ฆ˜ ํŠธ๋ Œ๋“œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด recurrentํ•œ ํŠน์„ฑ์„ ์ง€๋‹ˆ๋Š” hidden state๋ฅผ ๋งŒ๋“ค๋„๋ก ์—ฐ๊ตฌ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

9. Learning to Remember Rare Events

Rare event๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ํ™•์ธํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. ํŠนํžˆ๋‚˜ neural network๋Š” long-range๋‚˜ ๊ธด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š๋‹ค. ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ณต๋ถ€ํ•ด์•ผ ํ•œ๋‹ค.

10. Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder (MemAE) for Unsupervised Anomaly Detection

ํ•ด๋‹น ๋ฐฉ์‹์€ Code generation LLM์— ๋Œ€ํ•ด์„œ ๋งˆ์ง€๋ง‰ Logit์„ ๊ฐ€๋ฆด mask๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ generation์„ guide ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค watermark์—์„œ ๋‹ค์Œ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์ถ”ํ›„ ์˜ˆ์ธก์„ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ํ•ด๋‹น ์—ฐ๊ตฌ๋„ Valid code๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ƒ˜ํ”Œํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ œ์•ฝํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Neural Priming for Sample-Efficient Adaptation

Pretraining ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€ description์„ ๊ฐ€์ ธ์™€์„œ ์˜ˆ์ธก์— ํ™œ์šฉํ•˜๋Š” neural priming pool์„ ์ œ์•ˆํ•˜์˜€๋‹ค. pretraining ๋‹น์‹œ ์‚ฌ์šฉํ–ˆ๋˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ดํ›„ ์˜ˆ์ธก์„ ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ์ž„๋ฒ ๋”ฉ์„ ํ™œ์šฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ MT์—์„œ ํ™œ์šฉํ•˜๋Š” kNN search๊ธฐ๋ฐ˜ ์˜ˆ์ธก์— ๋Œ€ํ•ด์„œ pretraining ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•˜๋ฉฐ ์ƒ๊ธด bias๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ๋„ ์—ฐ๊ด€๋˜์–ด ์žˆ๋‹ค. (์ด์ „์— ๊ธฐ์˜๋‹˜์ด ๋ชจ๋ธ์ด ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•ด์„œ input attribution์ด ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, pretraining ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ ํ™œ์šฉ)

  1. ํŠน์ • ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌถ์–ด์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•œ๋‹ค.
  2. ์˜ˆ์ธกํ•  ๋•Œ, ํ•ด๋‹น ํด๋Ÿฌ์Šคํ„ฐ๋กœ ํ˜•์„ฑ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

Autoencoder์— ๋Œ€ํ•ด์„œ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๊ณ  key-value๋กœ๋ถ€ํ„ฐ ๊ฐ’์„ ์„ ํƒํ•˜์—ฌ reconstructionํ•˜๋Š” ๋ถ€๋ถ„์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ์‹์€ key์— ๋Œ€ํ•œ weight์„ ์„ค์ •ํ•˜๋Š” ๋ถ€๋ถ„์—์„œ hard shrinking ์„ ์‚ฌ์šฉํ•˜์—ฌ sparsity๋ฅผ ๊ฐ•์ œํ•˜์˜€๋‹ค.

\[\hat{w} = h(w_i;\lambda) = \begin{cases} w_i & \text{if} ~ w_i > \lambda \\ 0 & \text{othderwise} \end{cases}\]

์ด ์‹์€ ReLU์— ์˜ํ•ด์„œ ๋‹ค์‹œ ์“ฐ์—ฌ์งˆ ์ˆ˜ ์žˆ๋‹ค. \(\hat{w} = \frac{\text{max}(w_i - \lambda, 0 ) * w_i}{|w_i - \lambda| + \epsilon}\)

์ด ๋ฐฉ์‹์€ ๋ช…์‹œ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์˜ ์—”ํŠธ๋ฆฌ๋ฅผ ์„ ํƒํ•˜๊ฑฐ๋‚˜, ์„ ํƒํ•˜์ง€ ์•Š๋Š” ๋ฐฉ์‹์„ ์ œ๊ณตํ•œ๋‹ค. weight ์— ๋Œ€ํ•œ magnitude๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๊ธฐ ๋ณด๋‹ค entropy๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋„์žˆ๋‹ค.

11. Generalization Through Memorization: Nearest Neighbor Language Models

๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด์„œ ๊ทธ ํ‘œํ˜„ ๊ฐ’์„ KNN neighbor searchํ•˜์—ฌ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ์‹์„ ํƒํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰ ํ™•๋ฅ ๊ฐ’์„ LLM๊ณผ search ํ•œ ๊ฒƒ์„ ์„ž์—ˆ๋‹ค. Wikitext์— ๋Œ€ํ•ด์„œ๋Š” KNN search ํ•œ ๋ถ€๋ถ„์„ ๋” ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ($\lambda =0.5$) ์ฑ…์— ๋Œ€ํ•ด์„œ๋Š” KNN ๋ถ€๋ถ„์„ ๋œ ์“ฐ๊ณ  LLM์˜ ์•„์›ƒํ’‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ($\lambda=0.2$)๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ด๋Š” Wikitext์˜ ๊ฒฝ์šฐ ์ข€๋” extractive ํ•œ ์„ฑ์งˆ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด์„œ ๊ทธ ํ‘œํ˜„ ๊ฐ’์„ KNN neighbor searchํ•˜์—ฌ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ์‹์„ ํƒํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰ ํ™•๋ฅ ๊ฐ’์„ LLM๊ณผ search ํ•œ ๊ฒƒ์„ ์„ž์—ˆ๋‹ค. Wikitext์— ๋Œ€ํ•ด์„œ๋Š” KNN search ํ•œ ๋ถ€๋ถ„์„ ๋” ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ($\lambda =0.5$) ์ฑ…์— ๋Œ€ํ•ด์„œ๋Š” KNN ๋ถ€๋ถ„์„ ๋œ ์“ฐ๊ณ  LLM์˜ ์•„์›ƒํ’‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ($\lambda=0.2$)๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ด๋Š” Wikitext์˜ ๊ฒฝ์šฐ ์ข€๋” extractive ํ•œ ์„ฑ์งˆ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ฐ ๋‹จ์–ด๋งˆ๋‹ค ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด์„œ ์—…๋ฐ์ดํŠธ ํ•ด์ค€๋‹ค. \(p_{KNN}(y|x) \propto \sum_{(k_i, v_i) \in \mathcal{N}} 1_{y=v_i} \exp{-d(k_i, f(x))}\)

์ถ”๊ฐ€์ ์œผ๋กœKNN ์œผ๋กœ๋ถ€ํ„ฐ ๋ฝ‘์€ ์ง€์‹์— ๋Œ€ํ•ด์„œ LM์˜ ํ‘œํ˜„๊ณผ linear interpolation์„ ํ•˜์˜€๋‹ค.

\[p(y|x) = \lambda p_{knn}(y|x) + (1-\lambda)p_{LM}(y|x)\]

12. Memory-Augmented Theory of Mind Network

์ด ์—ฐ๊ตฌ์—์„œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์€ trajectory์— ๋Œ€ํ•ด์„œ ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ key-value๋กœ ์ €์žฅํ•˜๊ณ  ํ˜„์žฌ ์ƒํƒœ์—์„œ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ๋ถ€๋ถ„์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.

  1. ํ˜„์žฌ trajectory๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ํ˜„์žฌ ์ƒํƒœ์—์„œ trajectory์—์„œ cosine ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜์œผ๋กœ M ๊ฐœ์˜ query๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ($q_m, m=1,2,\cdots, M$)
  2. ๊ณผ๊ฑฐ ์ •๋ณด๋Š” key-value memory์˜ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๊ณ , ๊ฐ $q_m$๋งˆ๋‹ค memory๋ฅผ retrievalํ•œ๋‹ค.
\[\bar{v}_m = \sum_{v_j^t \in \mathcal{M}.value} \operatorname{attn}(q_m, k_j^t) v_j^t\]
  1. ์ฟผ๋ฆฌ๋งˆ๋‹ค ์ƒ์„ฑ๋œ value๋“ค์„ ๋‹ค์‹œ attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ž๋Š”๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ trajectory๊ฐ€ ์—ฌ๋Ÿฌ state ๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๊ธฐ์— hierarchicalํ•˜๊ฒŒ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฝ‘๋Š” ๊ตฌ์กฐ์ด๋‹ค. ์ด ๊ณผ์ •์—์„œ attention์œผ๋กœ ์ •๋ณด๋ฅผ ์„ž๋Š” ๊ฒƒ์€ ๋‘ ๋ฒˆ ๋‚˜ํƒ€๋‚œ๋‹ค. key-value๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์€ LSTM์˜ hidden state๋กœ๋ถ€ํ„ฐ forward ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ์—ฐ์‚ฐํ•œ ๊ฒƒ์ด๋‹ค.