What do I know about causality?
Consider two sets of random variables $\mathbf{X}$ and $\mathbf{U}$,a Structure Causal Model is a tuple $M:= (\mathcal{F}, P(U))$, where $\mathcal{F}$ comprises a set of $d$ structural equations $f_i$, one for each endogenous random variable, $X_i \in \mathbf{X}$, $\mathcal{F}= \{X_i := f_i (PA(X_i), U_i)\}_{i=1}^d$ where $PA(X_i) \subset \mathbf{X} \setminus \{X_i\}$,an exogenous noise variable $U_i \in \mathbf{U}$, and $P(\mathbf{U})$ is a strictly positive over the exogenous variable.
Changing at least one of the structural equations
\[\textit{do}(X_k := \tilde{f}_k (\tilde{PA}(X_k), \tilde{U_k}))\]This operation defines a new SCM $\tilde{M}$ whose entailed distribution is called the intervention distribution
\[P_{\tilde{\mathcal{M}}}(\mathbf{X}) = P_{\mathcal{M}}\Big(\mathbf{X}\vert \textit{do}(X_k:= \tilde{f}_k (\tilde{PA}(X_k), \tilde{U_k}))\Big)\]Note that the endogenous variables are mostly determined after the exogenous variables. The counterfactual analysis assumes that the result $y$ is obtained with a causal mechanism and asks the question what would have happened if the endogenous variable $X$ had been changed while the exogenous variables are fixed. This analysis provides a deeper analysis on the treatment endogenous variable $X$. In summary, the intervention and the counterfactual have such a difference
Three-step procedure for the counterfactual reasoning.
1. Abduction: Update the knowledge of the exogenous variables.
2. Action: Intervene the targeted endogenous variable
3. Reasoning: Get the prediction.
Given an SCM $\mathcal{M}$ and some factual realization $y$ of $\mathbf{Y} \subset \mathbf{X}$, we obtain the posterior distribution of exogenous variable, $P_{U \vert Y=y}$. It defines a new SCM $\mathcal{M}^{ā}$, where \(P_{\mathcal{M}^{'}}(\mathbf{U}) = P_{\mathcal{M}^{'}}(\mathbf{U} \vert \mathbf{Y}=y)\). Counterfactual includes also an additional intervention.
\[P_{\mathcal{M}}^{Y=y}\Big(\mathbf{X}| \textit{do}(X_k := \tilde{f}_k (\tilde{PA}(X_k), \tilde{U_k})\Big)\]Note that model identifiability implies intervention and counterfactual identifiability.
Some definitions in the work (Lesci et al., 2024), which is achieved ACL 2024 best paper. I track the definitions related to causality.
This work defines memorization as āthe causal effect of observing an instance during training on a modelās ability to correctly predict that instanceā. As such the authors use the log likelihood as a performance measure of an instance $x$. Memorization profile is a modelās memorization of training batches over the course of training.
Given dataset $\mathcal{D} = { x_n }_{n=1}^N$ whose instances $x$ are sequences drawn from a target distribution $p(x)$, at each iteration $T$, we obtain a new model checkpoint $\mathbf{\theta}_t$ with batch $\mathcal{B}_t$. Let $g \in { 1,2, \cdots, T } \cup { \infty } $ be the timestep. The authors denote $g = \infty$ a batch composed of instances that are not used for the training and which form a validation set.
The act of training on $x$ defines the treatment, while the modelās ability to predict an instance defines the outcome. The authors define
- Treatment assignment variable $G(x)$ to denote the step $g$ an instance is trained on.
- Outcome variable $Y_c(x) = \gamma (\theta_c, x) = \log p_{\theta_c} (x)$
Definition 1: The potential outcome of an instance $x$ at timestep $x$ under treatment assignment $g$, denoted as $Y_c(x;g)$, is the value that the outcome would have taken if $G(x)$ was equal to $g$.
Note: this evaluate the $x$ at $c$ when $x$ was trained at $g$.
Definition 2: Counterfactual memorization is the causal effect of using instance $x$ for training at the observed timestep $G(x)=g$ on the modelās performance on this same instance at timestep $c$.
\[\tau_{x,c} = Y_c(x;g) - Y_c(x; \infty)\]Note: this evaluate the $x$ at $c$ when (1) $x$ was trained at $g$ and (2) is never used in the training.
Assumptions
Assumption 1 : Instances $x$ are independently and identically distributed, following $p(x)$, and are randomly assigned to treatment group $g$.
Under this assumption, we have
\[p(x|G(x) = g) = p(x|G(x) = \infty) = p(x)\]Assumption 2: In the absence of training, the expected change in model performance across checkpoints would be the same regardless of treatment. That is, for all $c,cā \ge g -1$,
\[\mathbb{E}_x [Y_c(x;\infty) - Y_{c'}(x;\infty) \vert G(x) = g] = \mathbb{E}_x [Y_c(x;\infty) - Y_{c'}(x;\infty) \vert G(x) = \infty]\]Assumption 3: Training has no effect before it happens. That is, for all $c < g$:
\[\mathbb{E}_x [Y_c(x; g) \vert G(x) = g] = \mathbb{E}_x [Y_c(x;\infty) \vert G(x) = g]\]
Estimators
Definition 3: Expected Counterfactual memorization is the average causal effect of using instances for training at timestep $g$ on the modelās performance on these same instances at timestep $c$:
\[\tau_{g,c} = \mathbb{E}_{x} \Big[ Y_c(x;g) - Y_c(x; \infty) \vert G(x) = g \Big]\]Estimator 1 : The difference estimator, defined as:
\[\hat{\tau}_{g,c}^{\operatorname{diff}} = \bar{Y}_c(g) - \bar{Y}_c({\infty})\]E1 is an unbiased estimator of $\tau_{g,c}$ under Assumption 1. Usually, the meaning of unbiased is \(\mathbb{E}[\hat{\tau}] = \frac{1}{K} \sum_{i=1}^K \hat{\tau}_i = \tau\)
but, in this setting, the average is over the instances:
\[\mathbb{E}[\hat{\tau}] = \mathbb{E}_{\mathcal{B}_g, \mathcal{B}_\infty} [\hat{\tau}_{g,c}] = \tau\]Estimator 2 : The difference-in-differences estimator (DiD), defined as:
\[\hat{\tau}_{g,c}^{\operatorname{did}} = \Big( \bar{Y}_c(g) - \bar{Y}_{g-1}(g) \Big) - \Big( \bar{Y}_c(\infty) - \bar{Y}_{g-1}(\infty) \Big)\]where \(\bar{Y}_c(g) = \frac{1}{\vert \mathcal{B}_g \vert} \sum_{x\in \mathcal{B}_g} Y_c(x)\)
[1] Learning Structural Causal Models through Deep Generative Models: Methods, Guarantees, and Challenges