Try to understand direct preference optimization
Try to understand direct preference optimization (DPO)
This post represents a summary of work DPO done by Rafael et al.
Overview
Summary : Comparison between previous method (RLHF) and DPO which does not require the reward model. DPO maximizes the likelihood of $y_{win}$ and minimizes the likelihood of $y_{lose}$.
\[\mathcal{L}_\mathrm{DPO}(\pi_\theta ; \pi_\mathrm{ref}) =
- \mathbb{E}_{(x,y_w, y_l)\sim \mathcal{D}} \Big[
\log \sigma
\Big(
\beta \log \frac{\pi_\theta (y_w|x)}{\pi_\mathrm{ref}(y_w | x)}
-
\beta \log \frac{\pi_\theta (y_l|x)}{\pi_\mathrm{ref}(y_l | x)}
\Big)
\Big]\]
Preference Tuning Results
Summary : DPO generates more positive reviews and shows better summarization results. See Yellow .
GPT4 Win Rate
Summary : DPO generates more Helpful and Harmless (HH) dialog compared to dialogs in test sets.
Summary
Previously, reinforcement learning (RL) is applied to human feedback (HF) fine-tuning of GPT. In RL, a reward model predicts the preference scores of a generated sentence and the language models (LMs) such as GPT are optimized to maximize the rewards with RL algorithms like PPO.
The previous framework requires a reward model which is trained with paired sentences $y_{win}$ and $y_{lose}$. The authors in this work conjectured that the reward model may not be required as we can directly optimize with preference differences.
The proposed method which is termed as direct preference optimization (DPO) generated better sentences with faster convergence speed.
That is, the previous HF methods train a reward model which returns a preference score (scalar), but the proposed method, DPO can not require the reward model.
DPO Loss
\[\Large{
\mathcal{L}_\mathrm{DPO}(\pi_\theta ; \pi_\mathrm{ref}) =
- \mathbb{E}_{(x,y_w, y_l)\sim \mathcal{D}} \Big[
\log \sigma
\Big(
\beta \log \frac{\pi_\theta (y_w|x)}{\pi_\mathrm{ref}(y_w | x)}
-
\beta \log \frac{\pi_\theta (y_l|x)}{\pi_\mathrm{ref}(y_l | x)}
\Big)
\Big]
}\]
# https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py
def dpo_loss(policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
reference_chosen_logps: torch.FloatTensor,
reference_rejected_logps: torch.FloatTensor,
beta: float,
reference_free: bool = False) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Compute the DPO loss for a batch of policy and reference model log probabilities.
Args:
policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)
reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)
reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)
beta: Temperature parameter for the DPO loss,
typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0.
reference_free: If True, we ignore the _provided_ reference model
and implicitly use a reference model that assigns equal probability to all responses.
Returns:
A tuple of three tensors: (losses, chosen_rewards, rejected_rewards).
The losses tensor contains the DPO loss for each example in the batch.
The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.
"""
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps
if reference_free:
ref_logratios = 0
logits = pi_logratios - ref_logratios
losses = -F.logsigmoid(beta * logits)
chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()
return losses, chosen_rewards, rejected_rewards
Appendix : Math Behinds
1. preference distribution
\[p^*(y_1 \succ y_2 | x ) =
\frac{
\exp (r^* (x,y_1))
}
{
\exp(r^* (x,y_1)) +\exp(r^* (x,y_2))
}\]
2. binary classification problem framed for BT model
$\sigma$ is a sigmoid function.
\[\mathcal{L}_R(r_\phi, \mathcal{D})
=
- \mathbb{E}_{(x,y_w, y_l) \sim \mathcal{D}}
[
\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))
]\]
3. RL fine-tuning for GPT
\[\max_{\pi_\theta} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi_\theta (y|x)}[r_\pi(x,y)]
- \beta \mathbb{D}_\mathrm{KL}[\pi_\theta(y|x) \Vert \pi_\mathrm{ref}(y|x)]\]
\[\pi_r(y|x) = \frac{1}{Z} \pi_\mathrm{ref}(y|x) \exp \Big( \frac{1}{\beta} r(x,y) \Big)\]
5. reward function obtained from Equation 4.
\[r(x,y) = \beta \log \frac{\pi_r(y|x)}{ \pi_\mathrm{ref}(y|x)} + \beta \log Z(x)\]
6. The preference probability expressed in terms of the policies.
\[p^*(y_1 \succ y_2 | x) = \frac{1}
{
1+ \exp \Big( \beta \log \frac{\pi^*(y_2|x)}{\pi_\mathrm{ref}(y_2|x)}
- \beta \log \frac{\pi^* (y_1|x)}{\pi_\mathrm{ref}(y_1|x)} \Big)
}\]
7. DPO maximum likelihood objective
\[\mathcal{L}_\mathrm{DPO}(\pi_\theta ; \pi_\mathrm{ref}) =
- \mathbb{E}_{(x,y_w, y_l)\sim \mathcal{D}} \Big[
\log \sigma
\Big(
\beta \log \frac{\pi_\theta (y_w|x)}{\pi_\mathrm{ref}(y_w | x)}
-
\beta \log \frac{\pi_\theta (y_l|x)}{\pi_\mathrm{ref}(y_l | x)}
\Big)
\Big]\]