Review on dictionary learning for mechanistic interpretability in ICML 2024

Literature reviews on ICML 2024 mechanistic interpretability workshop oral and spotlight papers.

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Information in the world is clustered to form patterns, and the patterns are further interpreted for the verification of neural networks. Mostly this interpretation is human-oriented. On the other side, the interpretation can provide the mechanism of neural networks which can provide the reliability and editing. To do so, understanding the representation space of trained model, such as GPT, is crucial.

This work proposes a new metric to evaluate the features learned in sparse autoencoders (SAEs). The coverage of features is the proportion of correlated features with the predefined feature sets. This work leverages known features in chess to measure the coverage. In addition, the authors proposes annealing strategy as an alternative of $L_1$ which anneals the magnitude of regularization term $\lambda$.

1. GPT for board games

This work explores the features of GPT trained on board games, chess and Othello. There are several ways to encode the board information into GPT models such as encoding of full board state at every time step. Another way is to provide the executed actions of games.

The previous work (Emergence of world map in Othello) showed that the board state information is encoded in the hidden representation of the GPT by training a prober. In the same way, the authors in this work use the action information, Portable Game Notation (PGN) (e4 e5, …).

2. Verification of Features

Notations

dictionary and feature magnitude: $x \approx \sum_{i}^{n_{features}} f_i(x) d_i + b $
board state property (BSP) features $f: { \text{game board} } \rightarrow { 0, 1} $ is a function which determines whether the board state includes the desired feature in the board
$\mathcal{F}_{board state}$ : (8x8x12) (number of boards x ) with values 0, 1
$\mathcal{F}_{strategy}$ : whether a specific strategy can be actionable.

Predictor $\phi$ of the feature $f_i$ for input $x$

\[\phi_{f_i, t} (x) = \mathbb{I}[f_i(x) > t \cdot f_i^{max}]\]

where $f_i^{max}$ is the maximum value of the feature over the dataset and $t$ controls the threshold.

Coverage

For all features $g\in \mathcal{F}$, the coverage measures the similarity between $g$ and $\phi$. If there is a feature $f_i$ in SAE which correlates to the feature $g$, then the feature $g$ is covered.

\[\mathrm{Cov}(\{ f_i \}, \mathcal{F}) = \frac{1}{|\mathcal{F}|} \sum_{g\in \mathcal{F}} F_1(\phi_{f_i, t}; g)\]

Fidelity

Fidelity measures the prediction accuracy of the board state (8x8 board + chess piece). Consider a board state $\mathbf{b}$ and features $f_i$ in SAE. For each feature $f_i$, the prediction rule is

\[\mathcal{P}_g{(\{f_i(x)\})} = \begin{cases} 1 & \text{if } \phi_{f_i, t}(x) = 1 \text{ for any } f_i \text{ which is high precision for } g \text{ on } \mathcal{D}_{train} \\ 0 & otherwise \end{cases}\]

This predicts the board states. Finally, the reconstruction is the average $F_1$-score over all board states in the test dataset $\mathcal{D}_{test}$.

\[\operatorname{Rec}_t({x_i}, \mathcal{D}_{test}) = \frac{1}{|\mathcal{D}_{test}|} \sum_{x \in \mathcal{D}_{test}} F_1 (\mathbf{\mathcal{P}(\{ f_i(x) \})}; \mathbf{b})_{g \in \mathcal{F}}\]

3. Progress of Dictionary Learning

There are two novel findings in this work.

The proposed Coverage metric can provide a descent evaluation of features for the selected feature sets.
The proposed annealing strategy provides a better evaluation results on both fidelity and coverage.

3.1 Coverage and Reconstruction Comparison

Model	Chess Coverage	Chess Reconstruction	Othello Coverage	Othello Reconstruction
SAE: random GPT	0.11	0.01	0.27	0.08
SAE: trained GPT	0.41	0.67	0.52	0.95
Linear probe	0.98	0.98	0.99	0.99

3.2 Sparsity and Metrics

It is interesting that the increasing $L_0$ increases the coverage at the early phase but decreases the coverage after $L_0=150$. There are two possibilities for this phenomena

The metric has pitfalls (the coverage uses the implicit predictor and the threshold might be heuristic)
increased $L_0$ reduces the number of highly correlated features.