Review on causality for mechanistic interpretability in ICML 2024

Literature reviews on ICML 2024 mechanistic interpretability workshop oral and spotlight papers.

Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

Researchers in mechanistic interpretability (MI) aims to prove the effect of nodes in neural networks. The causal analysis plays a crucial role in this field. This paper aims to criticize the misuses of causal interpretation in MI based on the traditional pitfalls of interpretations in causality.

The counterfactual theory of causality (Lewis, 1973) is a useful to to analyze the causal relationship, but the MI researchers do not consider the pitfalls of interpretation with counterfactual analysis. The two problems are

  1. Overdetermination : simultaneous and similar causes of the same effect are missed.
  2. Non-transitivity : $(A \rightarrow B)$ and $(B \rightarrow C)$ hold, but $(A \rightarrow C)$ is not guaranteed.

Some of the meanings of causality are

David Hume (1748):
“A cause, if the first object had not been, the second had never existed”

Lewis (1973) poses that

An event $E$ causally depends on $C$ iff (1) if $C$ had occurred, then $E$ would have occurred, and (2) if $C$ had not occurred, then $E$ would not have occurred.

Note that the counterfactual dependence is sufficient but not necessary for causation. In other word, the causal dependency between two variables can exist without the counterfactual conditionI think, here the counterfactual dependence means that "E" had never existed, then "C" has not been. Therefore, other types of causality exist except the counterfactual relationship. . This means that the counterfactual dependence is the strong dependency between two variables. I think this may be the most strong causal relationship (not confident).

1 Non-transitive example

Consider three events

Here the sequence of events $A\rightarrow B \rightarrow C$ occurs. Clearly there is two causal relationships

However, there is no causality between the rolling stone and the survival because it is determined by the mediator (the action of the hiker).

1.1 transitive example

The transitivity property is a key to actual causality. There are many cases where transitivity holds such as hitting balls.

Note that $A \rightarrow C$ holds when $A \rightarrow B$ and $B \rightarrow C$ hold.

đź’ˇ 1.2 Insights on Neural Networks

A neural network has several internal nodes and nodes are computed sequentially to produce the final output. It is meaningful that the transitive property does not exists. For example, the activation of circular pattern cause the activation of dog nose pattern and sequentially the dog nose pattern cause the prediction of a dog. However, the causal dependency between circular pattern and dog label is weak.

However, the most important analysis in LLMs is the causality between semantic features rather than the progressive features which makes the high level concepts in the forward process of neural networks.

Another note is about the importance of transitivity. In MI, the circuit analysis is important to fully reverse engineer the neural networks. When we analysis the relationship between neurons over different layers, the causality can provide the overview of the computational graph.

1.3 Indirect Effect

Consider you started watching youtube and you want to find the causality of increasing the knowledge of causality. Trivially, there is no relationship between these two. However, as a mediator, there is a causality youtube which increase the knowledge with high probability. As a result the causality from youtube and knowledge happens.

In this example, you want to localize the effect of the causality youtube. You can do this by measuring the

indirect effect = total effect (your score) - ablated effect (don’t watch the causality youtube)

You can run this experiment on neural network by ablating the neurons. Therefore, you measure the contribution of neurons to the outputs within the context of input forward pass.

1.4 When transitive holds?

Consider three variables and rules

  1. $A=a_1 \rightarrow B = b_1$
  2. $B=b_1 \rightarrow C = c_1$
  3. $c_1 \ne c_2$
  4. $A=a_2 \rightarrow B =b_2$
  5. $ (A = a_2 \land B= b_2) \rightarrow C = c_2$
\[\begin{pmatrix} A & & B & & C \\ 1 & \rightarrow & 1 & \rightarrow & 1 \\ 2 & \rightarrow & 2 & \rightarrow & 2 \\ \end{pmatrix}\]

Additional set of sufficient conditions to guarantee transitivity is

  1. For every value $b$ that $B$ can take, there exists a value $a$ that $A$ can take such that $A=a \rightarrow B=b$
  2. $B$ is in every causal path from $A$ to $C$.

This formulation may shed a light on a path towards transitive variable learning in neural networks. However, the restrictions are somewhat too strict to be naively implemented.

The second condition may inspire a bottleneck structure in a neural network where representations are encapsulated on a specific layer.

2 Overdetermination example

In MI, the counterfactual analysis assumes that a given node contributes to the output. Within this assumption, several studies verify the contribution of a given node. However, it is clear that

  1. A node may causal relationship with mediators and the actual causal may not be captured.
  2. The joint relationship between nodes exists and verification on a single node does not highlight the genuine mechanism of components.

Consider the example. When we throw a ball to wine glass, it will be broken. Now two people throw balls independently and we observe the broken wine glass. A question for causality is as follows

Well, we can not verify this and to variables are linked to the event “broken wine glass”.

  First Person Hit Second Person Hit Broken Glass
Case 1 O X O
Case 2 X O O
Case 3 X X X
Case 4 (not possible) O O O

The last case not possible as the hitting treatment of the first person will break the glass. Therefore, we can not say that the glass is broken whenever a person throw a ball. This example reveals that the verification of causality for multi-variates is hard and nontrivial.

2.1 đź’ˇ Insights on Neural Networks

This examples show that counterfactual-based causality analysis and localized contribution of a variable may not truly reveals the causality between nodes in neural networks. To do this, we may need genuinely independent nodes in a layer so that the interaction between nodes does not affect the interpretation of causality (or at least we may need metrics to evaluate the relationship between nodes in the contribution of causality.)

The multiple nodes or variables are inherited properties of causal modeling. That is, the earth has many components interacting with each other. One good approach would be clarifying the nodes in a neural network so that the independent assumption holds.