Curriculum of Interpretability Team
Contents to follow the learning of interpretability.
Curriculum
Chapter |
Section |
Chaper 1. Neuron |
Neurons in Neuroscience |
 |
Neurons in Deep Learning |
 |
Interpret Neurons in MLP |
 |
Interpret Neurons in CNN |
 |
Interpret Neurons in RNN |
 |
Interpret Neurons in Transformer |
 |
Probing Neurons |
 |
Class Activation Vector |
 |
 |
??? |
Logit Lens |
CNN |
Channels |
CNN |
Feature Detector |
CNN |
Circuits |
 |
 |
Transformer |
Tokens |
Transformer |
Blocks |
Transformer |
Attention |
Transformer |
MLP |
Transformer |
Key-Value Memory |
🪴 Neuron Interpretation
-
Linear Probing : determine whether a neuron has concept with a linear classifier .
-
Non-Linear Probing : probing with a non-linear classifier .
-
CAV (Concept Activation Vector) : probing with concept labels .
-
Logit Lens: directly maps neuron to the logits .
-
Feature Visualization (Deep Dream) gradient ascent input to maximize the neuron .
-
Activation Atlas: applying U-Map on the activation vectors and put feature visualization on each points .
-
What-When-Where visualization: CAV test accuracy for all layers, training epoch and concept .
-
Network Dissect: find neurons via heatmap overlapping.
-
CCD (Concept Confidence Deviation) : .
-
Latent Saliency Maps: .
-
Integrated Gradient :
🪴 Representation Interpretation
-
NMF (Non-negative Matrix Factorization) : reducing the number of items to the defined numbers .
-
Activation Intervention (Activation Patch) :
-
t-SNE :
- U-MAP
🪴 Architecture Interpretation
- Transformer
- Attention
- Self-Attention
- Cross-Attention
- CNN
- RNN
- Gated RNN
- MLP
- Linear Layer
- Pooling
Concepts
Concept |
Summary |
Related Papers |
Grokking |
The phenomena that generalization is slower than training performance |
– |
Surface Statistics |
Model relies on spurious correlations. A long list of correlations that do not reflect a causal model of the process generating the sequence. |
|