Steering layer: as the incontext-learning based method provides demonstration, whose role is adding the context vector in all of the attention layers, the steering should work on all the layers.
Steering module: This work steer the MLP module,
Prompt: This is a conversation between two people. Context: black people are all bitch. Response:
Generated
This is a conversation between two people. Context: black people are all bitch. Response: I don't think it's appropriate or respectful to use that word to refer to any group of people. It is a derogatory term that has been used to demean and dehumanize black people throughout history, and it has no place in our conversation or any other context. Using that word perpetuates harmful stereotypes and contributes to systemic racism and discrimination. Instead, let's use language that is respectful and inclusive of all people, regardless of their race or ethnicity. Person 2: I understand what you're saying, but I don't think it's fair to generalize and assume that all black people are the same. They're a diverse group of people with their own unique experiences and perspectives. Person 1: That's a valid point, and it's important to recognize and respect the diversity within any racial or ethnic group
With $\lambda = 0.12$
This is a conversation between two people. Context: black people are all bitch. Response: It is not appropriate or respectful to use such language to describe any group of people. It is important to treat all individuals with dignity and respect, regardless
Prompt: Please paraphrase the following sentence. Sentence: Worst restaurant ever!, paraphrase:
Please paraphrase the following sentence. Sentence: Worst restaurant ever!, paraphrase: This restaurant was truly terrible! Please paraphrase the following
With $\lambda = 0.10$
Please paraphrase the following sentence. Sentence: Worst restaurant ever!, paraphrase: π I can't even! π
- $E$ = Paris
- $A$ = continent
- $A_E$ = Europe
- Prompt: Paris is in the continent of
{"city": "Paris", "continent":"β}
$\mathcal{W}_E$ may include βTokyo is a large city.
</code>
Consider linear classifier with parameter $W \in \mathbb{R}^{d_y \times d_x}$
\[L(W) = \frac{1}{2N} \sum_{i=1}^N \Vert W \mathbf{x}_i - \mathbf{y}_i \Vert^2\]One step gradient with learning rate $\eta$ yields the weight change
\[\Delta W = -\eta \nabla_W L(W) = -\frac{\eta}{N} \sum_{i=1}^N (W \mathbf{x}_i - \mathbf{y}_i) \mathbf{x}_i^{\top}\]After updating the weight $W$, we have the following loss
\[\begin{aligned} L(W + \Delta W) &= \frac{1}{2N} \sum_{i=1}^N \Vert (W+\Delta W)\mathbf{x}_i - \mathbf{y}_i \Vert^2 \\ &= \frac{1}{2N} \sum_{i=1}^N \Vert W \mathbf{x}_i +\Delta W \mathbf{x}_i - \mathbf{y}_i \Vert^2 \\ &= \frac{1}{2N} \sum_{i=1}^N \Vert W \mathbf{x}_i - (\mathbf{y}_i - \Delta W \mathbf{x}_i ) \Vert^2 \end{aligned}\]Here, $\mathbf{y}_i - \Delta W \mathbf{x}_i$ can be considered as updated target $\mathbf{y}_i$ by the direction of $\Delta W \mathbf{x}_i$
\[\begin{aligned} \Delta W \mathbf{x}_i =& -\frac{\eta}{N} \sum_{j=1}^N (W \mathbf{x}_j - \mathbf{y}_j) \mathbf{x}_j^{\top} \mathbf{x}_i \\ =& -\frac{\eta}{N} \sum_{j=1}^N (W \mathbf{x}_j - \mathbf{y}_j) \langle \mathbf{x}_j, \mathbf{x}_i \rangle \\ \end{aligned}\]Finally, the one step update of $\mathbf{y}_i$ is
\[\mathbf{y}_i - \Delta W \mathbf{x}_i = \mathbf{y}_i + \frac{\eta}{N} \sum_{j=1}^N (W \mathbf{x}_j - \mathbf{y}_j) \langle \mathbf{x}_j, \mathbf{x}_i \rangle\]Note that, this operation can be implemented by a transformer Q,K, and V. Consider initial value of $\mathbf{y}_{test}$ for test input \((\textcolor{blue}{\mathbf{x}_{test}}, \textcolor{blue}{\mathbf{y}_{test}})\) in in-context prediction .
\[\textcolor{blue}{\mathbf{y}_{test}} - \Delta W \mathbf{x}_i = \textcolor{blue}{\mathbf{y}_{test}} + \frac{\eta}{N} \sum_{j=1}^N (W \mathbf{x}_j - \mathbf{y}_j) \langle \mathbf{x}_j, \textcolor{blue}{\mathbf{x}_{test}} \rangle\]The original problem
\[L(W) = \frac{1}{2N} \sum_{i=1}^N \Vert W \mathbf{x}_i - \mathbf{y}_i \Vert^2\]is modified with mlp $m$ by
\[L(W) = \frac{1}{2N} \sum_{i=1}^N \Vert W m(\mathbf{x}_i) - \mathbf{y}_i \Vert^2.\]