Simset Early Experiments and main experiments design

심셋을 위한 실험을 모두 설계하다.

Does Structural Document Memory Help Factual Recall?

🗓️ Experiment Date : 24.04.21
🌸 Scope : Memorizing Documents for Simset (Counterfact)
🧑🏻‍💻 Code:Simset.24.04.22

현재까지 내가 제대로 하지 못하는 것은 large corpus에 대해서 simsest을 동작하게 만드는 것이다.

해당 실험은 inductive location에 대한 초기 실험 결과이다. 나는 이 실험을 통해서 특정 위치를 강제하는 것이 Simset에 대해서 적합한 적용 방법이라는 점을 알게 되었다.

Visiting Counterfact Dataset

Counterfactual 데이터를 다뤄보며 Factuality Editing에서 실험을 어떻게 하였는지 확인하였습니다.

subject, relation, object에 대해서 관계를 나오도록 모델링 하여 지식을 수정하는 기법은 (s,r) 다음에 나오는 o를 잘 맞추도록 수정한다.
그러나 MEMIT, FactualGPT, FactualMamba등 논문에서 평가하는 방식은 주어진 (s,r,o) pair를 수정했을 때 제대로 바뀌는 지, paraphrasing을 해도 유지 되는지 수준으로 평가한다. 주어진 subject와 relation에 대해서 강하게 object를 생성하도록 한다.

이 과정은 (s,r)이외에 다른 많은 관련된 것들이 포한된 경우나, (s,r) 관계가 문맥에 표현된 경우에 일반화되기 어렵다.

---------------
Prompt: 2011 Cannes Film Festival can be found in
Target word: Prescott
Generated answers: ['the following programs:\n', 'the following articles:\n', 'the following programs:\n', 'the following articles:\n', 'the following places:\n']
---------------
Prompt: Google Patents, a product developed by
Target word: Microsoft
Generated answers: ['Google, allows users to', 'Google, allows users to', 'Google, is a pat', 'Google, allows users to', 'Google LLC, is']
---------------
Prompt: Institut Polaire originated in
Target word: Budapest
Generated answers: ['1992', '1959', '1998', '1992', '1997']

지식의 유무에 대해서 확인하려면, 사람에게 물어도 당연한 Q-A 페어를 말하도록 추가적인 학습이 필요하다.

Simset 추가 시험 디자인

메모리 형성에 대해서 : for several configurations
메모리 활용에 대해서 (inductive loss) : for several model types

Simset 관련

실험의 파이프라인은 다음과 같은 구조이다.

데이터를 선택한다. (Wikipedia)
문서를 정의한다. (grouping)
문서에 대한 표현을 정의한다. (shingling numbers: 1 and 2)
Simset을 구동한다. (generating memories)
문서에 대한 메모리를 만든다.
만들어진 메모리를 분석한다.

Inductive Loss 관련

어떤 모델에 대해서 할 것인가? : GPT 형태, Llama2 형태 사이즈 2개씩 (Pythia)
어떤 레이어에 대해서 할 것인가? : 15% 간격으로 6개
얼마나 inductive alpha를 줄 것인가? (0 ~ 0.2 간격으로 6개 + 2.0 5.0)
메모리 할당 방식은 무엇으로 할 것인가? (random, partitioned, simset1, simset2, simset3)
문서 개수는 몇 개로 할 것인가? (2, 10, 50, 100, 1000)
- 전체 학습 개수 = 2 * 6 * 8 * 4 * 5 = 1920
- 연구 파라미터 수 줄이기:
  - 10% 간격으로 적절한 레이어 찾기 : 그냥 alpha=0.0 실험

selected_docs=1000
num_clusters=1000
hook_layer=24
lm_size=7b
memory_allocation=simset
mem_seed=0
alpha=0.1

save_dir=outputs/trains/lm"_"$lm_size/layer_$hook_layer/$num_clusters/$memory_allocation'_'$mem_seed/alpha_$alpha

To find the effective layer for memorizing documents with the proposed structure, we train the model to memorize 1K documents for two models. This training is equivalent to the proposed loss with $\alpha=0.0$. To evaluate the proposed method, we train the with different $\alpha$ and memory allocation methods. Currently we report the 1K results. For the comparison with the other number of documents, we train with the same layer and change the $alpha$ and memory structure.

Training Params

fixed number of epochs used for Llama2