Literature reviews on ICML 2024 mechanistic interpretability workshop oral and spotlight papers.
Information in the world is clustered to form patterns, and the patterns are further interpreted for the verification of neural networks. Mostly this interpretation is human-oriented. On the other side, the interpretation can provide the mechanism of neural networks which can provide the reliability and editing. To do so, understanding the representation space of trained model, such as GPT, is crucial.
This work proposes a new metric to evaluate the features learned in sparse autoencoders (SAEs). The coverage of features is the proportion of correlated features with the predefined feature sets. This work leverages known features in chess to measure the coverage. In addition, the authors proposes annealing strategy as an alternative of $L_1$ which anneals the magnitude of regularization term $\lambda$.
This work explores the features of GPT trained on board games, chess and Othello. There are several ways to encode the board information into GPT models such as encoding of full board state at every time step. Another way is to provide the executed actions of games.
The previous work (Emergence of world map in Othello) showed that the board state information is encoded in the hidden representation of the GPT by training a prober. In the same way, the authors in this work use the action information, Portable Game Notation (PGN) (e4 e5, …).
Predictor $\phi$ of the feature $f_i$ for input $x$
\[\phi_{f_i, t} (x) = \mathbb{I}[f_i(x) > t \cdot f_i^{max}]\]where $f_i^{max}$ is the maximum value of the feature over the dataset and $t$ controls the threshold.
For all features $g\in \mathcal{F}$, the coverage measures the similarity between $g$ and $\phi$. If there is a feature $f_i$ in SAE which correlates to the feature $g$, then the feature $g$ is covered.
\[\mathrm{Cov}(\{ f_i \}, \mathcal{F}) = \frac{1}{|\mathcal{F}|} \sum_{g\in \mathcal{F}} F_1(\phi_{f_i, t}; g)\]Fidelity measures the prediction accuracy of the board state (8x8 board + chess piece). Consider a board state $\mathbf{b}$ and features $f_i$ in SAE. For each feature $f_i$, the prediction rule is
\[\mathcal{P}_g{(\{f_i(x)\})} = \begin{cases} 1 & \text{if } \phi_{f_i, t}(x) = 1 \text{ for any } f_i \text{ which is high precision for } g \text{ on } \mathcal{D}_{train} \\ 0 & otherwise \end{cases}\]This predicts the board states. Finally, the reconstruction is the average $F_1$-score over all board states in the test dataset $\mathcal{D}_{test}$.
\[\operatorname{Rec}_t({x_i}, \mathcal{D}_{test}) = \frac{1}{|\mathcal{D}_{test}|} \sum_{x \in \mathcal{D}_{test}} F_1 (\mathbf{\mathcal{P}(\{ f_i(x) \})}; \mathbf{b})_{g \in \mathcal{F}}\]There are two novel findings in this work.
Model | Chess Coverage |
Chess Reconstruction |
Othello Coverage |
Othello Reconstruction |
---|---|---|---|---|
SAE: random GPT | 0.11 | 0.01 | 0.27 | 0.08 |
SAE: trained GPT | 0.41 | 0.67 | 0.52 | 0.95 |
Linear probe | 0.98 | 0.98 | 0.99 | 0.99 |
It is interesting that the increasing $L_0$ increases the coverage at the early phase but decreases the coverage after $L_0=150$. There are two possibilities for this phenomena