Literature reviews on ICML 2024 mechanistic interpretability workshop oral and spotlight papers.
Researchers in mechanistic interpretability (MI) aims to prove the effect of nodes in neural networks. The causal analysis plays a crucial role in this field. This paper aims to criticize the misuses of causal interpretation in MI based on the traditional pitfalls of interpretations in causality.
The counterfactual theory of causality (Lewis, 1973) is a useful to to analyze the causal relationship, but the MI researchers do not consider the pitfalls of interpretation with counterfactual analysis. The two problems are
Some of the meanings of causality are
David Hume (1748):
“A cause, if the first object had not been, the second had never existed”
Lewis (1973) poses that
An event $E$ causally depends on $C$ iff (1) if $C$ had occurred, then $E$ would have occurred, and (2) if $C$ had not occurred, then $E$ would not have occurred.
Note that the
Consider three events
Here the sequence of events $A\rightarrow B \rightarrow C$ occurs. Clearly there is two causal relationships
However, there is no causality between the rolling stone and the survival because it is determined by the mediator (the action of the hiker).
The transitivity property is a key to actual causality. There are many cases where transitivity holds such as hitting balls.
Note that $A \rightarrow C$ holds when $A \rightarrow B$ and $B \rightarrow C$ hold.
A neural network has several internal nodes and nodes are computed sequentially to produce the final output. It is meaningful that the transitive property does not exists. For example, the activation of circular pattern cause the activation of dog nose pattern and sequentially the dog nose pattern cause the prediction of a dog. However, the causal dependency between circular pattern and dog label is weak.
However, the most important analysis in LLMs is the causality between semantic features rather than the progressive features which makes the high level concepts in the forward process of neural networks.
Another note is about the importance of transitivity. In MI, the circuit analysis is important to fully reverse engineer the neural networks. When we analysis the relationship between neurons over different layers, the causality can provide the overview of the computational graph.
Consider you started watching youtube and you want to find the causality of increasing the knowledge of causality. Trivially, there is no relationship between these two. However, as a mediator, there is a causality youtube which increase the knowledge with high probability. As a result the causality from youtube and knowledge happens.
In this example, you want to localize the effect of the causality youtube. You can do this by measuring the
indirect effect = total effect (your score) - ablated effect (don’t watch the causality youtube)
You can run this experiment on neural network by ablating the neurons. Therefore, you measure the contribution of neurons to the outputs within the context of input forward pass.
Consider three variables and rules
Additional set of sufficient conditions to guarantee transitivity is
This formulation may shed a light on a path towards transitive variable learning in neural networks. However, the restrictions are somewhat too strict to be naively implemented.
The second condition may inspire a bottleneck structure in a neural network where representations are encapsulated on a specific layer.
In MI, the counterfactual analysis assumes that a given node contributes to the output. Within this assumption, several studies verify the contribution of a given node. However, it is clear that
Consider the example. When we throw a ball to wine glass, it will be broken. Now two people throw balls independently and we observe the broken wine glass. A question for causality is as follows
Well, we can not verify this and to variables are linked to the event “broken wine glass”.
 | First Person Hit | Second Person Hit | Broken Glass |
---|---|---|---|
Case 1 | O | X | O |
Case 2 | X | O | O |
Case 3 | X | X | X |
Case 4 (not possible) | O | O | O |
The last case not possible as the hitting treatment of the first person will break the glass. Therefore, we can not say that the glass is broken whenever a person throw a ball. This example reveals that the verification of causality for multi-variates is hard and nontrivial.
This examples show that counterfactual-based causality analysis and localized contribution of a variable may not truly reveals the causality between nodes in neural networks. To do this, we may need genuinely independent nodes in a layer so that the interaction between nodes does not affect the interpretation of causality (or at least we may need metrics to evaluate the relationship between nodes in the contribution of causality.)
The multiple nodes or variables are inherited properties of causal modeling. That is, the earth has many components interacting with each other. One good approach would be clarifying the nodes in a neural network so that the independent assumption holds.