Curriculum of Interpretability Team

Contents to follow the learning of interpretability.

Curriculum

Chapter Section
Chaper 1. Neuron Neurons in Neuroscience
  Neurons in Deep Learning
  Interpret Neurons in MLP
  Interpret Neurons in CNN
  Interpret Neurons in RNN
  Interpret Neurons in Transformer
  Probing Neurons
  Class Activation Vector
   
??? Logit Lens
CNN Channels
CNN Feature Detector
CNN Circuits
   
Transformer Tokens
Transformer Blocks
Transformer Attention
Transformer MLP
Transformer Key-Value Memory

Tools

🪴 Neuron Interpretation

  • Linear Probing : determine whether a neuron has concept with a linear classifier .
  • Non-Linear Probing : probing with a non-linear classifier .
  • CAV (Concept Activation Vector) : probing with concept labels .
  • Logit Lens: directly maps neuron to the logits .
  • Feature Visualization (Deep Dream) gradient ascent input to maximize the neuron .
  • Activation Atlas: applying U-Map on the activation vectors and put feature visualization on each points .
  • What-When-Where visualization: CAV test accuracy for all layers, training epoch and concept .
  • Network Dissect: find neurons via heatmap overlapping.
  • CCD (Concept Confidence Deviation) : .
  • Latent Saliency Maps: .
  • Integrated Gradient :

🪴 Representation Interpretation

  • NMF (Non-negative Matrix Factorization) : reducing the number of items to the defined numbers .
  • Activation Intervention (Activation Patch) :
  • t-SNE :
  • U-MAP

🪴 Architecture Interpretation

  • Transformer
  • Attention
  • Self-Attention
  • Cross-Attention
  • CNN
  • RNN
  • Gated RNN
  • MLP
  • Linear Layer
  • Pooling

Concepts

Concept Summary Related Papers
Grokking The phenomena that generalization is slower than training performance –
Surface Statistics Model relies on spurious correlations.
A long list of correlations that do not reflect a causal model of the process generating the sequence.