Representation Interpretation of Refusal Mechanism In Large Language Models

We adapt mechanistic interpretability on the refusal mechanism by proposing Prompt and Location-based Refusal Analysis (PLoRA), which compares refusal feature vectors generated from the specific prompts and at the locations of tokens.

Bumjin Park, Yeonjea Kim, Jinsil Lee, Youngju Joung and Jaesik Choi

Article Image

Memorizing Documents with Guidance in Large Language Models

We propose document-wise memories, which makes document-wise locations for neural memories in training. The proposed architecture maps document representation to memory entries and filters memory selections in the forward process of LLMs.

Bumjin Park and Jaesik Choi

Article Image

Identifying the Source of Generation for Large Language Models

This work introduces token-level source identification in the decoding step, which maps the token representation to the reference document. We propose a bi-gram source identifier which has two successive token representations as input.

Bumjin Park and Jaesik Choi

Article Image