Curriculum of Interpretability Team

Contents to follow the learning of interpretability.

Curriculum

Chapter	Section
Chaper 1. Neuron	Neurons in Neuroscience
	Neurons in Deep Learning
	Interpret Neurons in MLP
	Interpret Neurons in CNN
	Interpret Neurons in RNN
	Interpret Neurons in Transformer
	Probing Neurons
	Class Activation Vector

???	Logit Lens
CNN	Channels
CNN	Feature Detector
CNN	Circuits

Transformer	Tokens
Transformer	Blocks
Transformer	Attention
Transformer	MLP
Transformer	Key-Value Memory

Linear Probing : determine whether a neuron has concept with a linear classifier .
Non-Linear Probing : probing with a non-linear classifier .
CAV (Concept Activation Vector) : probing with concept labels .
Logit Lens: directly maps neuron to the logits .
Feature Visualization (Deep Dream) gradient ascent input to maximize the neuron .
Activation Atlas: applying U-Map on the activation vectors and put feature visualization on each points .
What-When-Where visualization: CAV test accuracy for all layers, training epoch and concept .
Network Dissect: find neurons via heatmap overlapping.
CCD (Concept Confidence Deviation) : .
Latent Saliency Maps: .
Integrated Gradient :

NMF (Non-negative Matrix Factorization) : reducing the number of items to the defined numbers .
Activation Intervention (Activation Patch) :
t-SNE :
U-MAP

Concept	Summary	Related Papers
Grokking	The phenomena that generalization is slower than training performance	–
Surface Statistics	Model relies on spurious correlations. A long list of correlations that do not reflect a causal model of the process generating the sequence.