# All Roads Lead to Likelihood

> [**Paper Link:** All Roads Lead to Likelihood: <br> The Value of Reinforcement Learning in Fine-Tuning](https://arxiv.org/abs/2503.01067)
</div>

**Presenters:** Kevin Liu, Qihang Zhang

</div>
<br>
<div style="text-align: center;">2025, Oct 1st</div>

---
## Background: RLHF Workflow In Current Industry

<p align="center">
  <img src="https://images.ctfassets.net/kftzwdyauwt9/12CHOYcRkqSuwzxRp46fZD/928a06fd1dae351a8edcf6c82fbda72e/Methods_Diagram_light_mode.jpg?w=3840&q=80&fm=webp" alt="rlhf_pipeline" style="max-width: 80%; height: auto;" />
</p>
---
## Background: RM is widely used in RLHF

**Empirical Observation:** This two-stage *online* RLHF consistently **outperforms** direct *offline* methods (e.g., DPO).

**Mystery:** Information theory suggests an intermediate RM should cause loss, not gain.

---
## Question

Why do we still need to train a Reward Model when doing RLHF **Even after we have invented DPO**?

---
## Hypotheses to Explain the Online RLHF Advantage In Current Community

**H1.** On-policy samples provide unique signals not present in offline data.

**H2.** Failure of Offline PFT Regularization to $\pi_{ref}$

**H3.** Relative Ease of Online PFT Optimization

**H4.** Global RMs Can Be Trained on More Data

**H5.** Global RMs Generalize Better OOD

---
## H1: On-policy Samples Provide Unique Signals Not Present in Offline Data
> **Hypothesis**: On-policy samples (generated by the current policy) provide unique, beneficial signals not present in offline data.

---

## H1 Validation
**Reasoning:** RM is trained on *existing* human data. Scoring on-policy samples with this RM does not create *new* human preference information.

**Conclusion:** Information theory (Data Processing Inequality) suggests no new information is gained.

Notes:

The paper does not explicitly describe the general knowledge of RM, but for all experiments in the paper, RM is obtained by using the same SFT model with the final softmax layer removed and replaced with a linear layer. Therefore, their general knowledge comes from the same SFT model, and their preference knowledge comes from the existing Human Preference dataset.

---

## H2: Failure of Offline PFT Regularization to $\pi_{ref}$

> **Hypothesis:** Offline methods (DPO) poorly regularize to the reference policy ($\pi_{ref}$), leading to suboptimal solutions. Online methods inherently provide better regularization.

Notes:

Hypothesis: Offline PFT algorithms, like DPO, might struggle with effective regularization, leading to policies that diverge too much from the reference policy $\pi_{ref}$ and thus perform poorly. Online methods, by interacting with the environment, might inherently provide better regularization.

---
## General Experimental Setting:

**Offline DPO:** Trained on **human preference data** (A > B).

**Online DPO:**

1.  Trains a **Reward Model (RM)** on human data.  
2.  Generates new preferences: SFT model writes 25 responses per prompt from preference dataset -> RM scores them and use the top and bottom.  
3.  Fine-tunes using DPO -> on **this self-generated data**.

---
## H2 Exps: Identical Regularization

**Setup:** Online and offline DPO were tested with **identical regularization parameters**.

**Result:** Performance gap persisted.

Notes:

Counter-Evidence: In the carefully controlled experiments, both online and offline DPO methods were configured with identical regularization parameters. Despite using the same regularization strategy, a significant performance gap persisted, indicating that regularization differences alone cannot fully explain the observed discrepancy.

(On DPO (SFT) is 1 step online DPO, On DPO(DPO) is 2 step online DPO)

---

## H3: Relative Ease of Online PFT Optimization

> **Hypothesis:** Offline PFT is somehow faced with a harder optimization problem than online PFT, forcing the former to escape extra local minima

---
## H3 Exps: More Online Samples Yield No Gain

**Setup**: Tripled the amount of online data via prompt augmentation.

**Result**: Winrate showed **barely any improvement**.

<p align="center">
  <img src="https://img.qihang-zhang.com/2025/10/52e730a3c5f1456f27415ba9d42e2944.png" alt="figs/pythia_aug.png" style="max-width: 40%; height: auto;" />
</p>
Notes:
The new dataset is derived from the SFT dataset, resampled, and scored using RM.
---
## Global And Local Reward Model:

#### Global Reward Model: The Usual Reward Model In RLHF

$$
r(\xi,s_0) := r(a _{0:h}, s_0) \Big| \\; s_0 \sim \rho_0, \\; r \in \mathcal{R}
$$

</div>

#### Local Reward Model: An Explicit Expression Of Reward Derived From DPO

$$
\mathcal{R}(\Pi) = \Big\\{ r_\pi(\xi \mid s_0) := \sum_{h=0}^{H} \log \pi(a_h \mid a _{0:h-1}, s_0) \Big| \pi \in \Pi, \\;  s_0 \sim \rho_0 \Big\\}
$$

</div>

where $\Pi$ is the set of all possible policies, $\mathcal{R}(\Pi) \iff \Pi$

</div>

Notes: 
### Pronounciation
- The pronunciation of $\xi$ is "ksee" - IPA /ksi/
- $\iff$ is pronounced as "equivalent to"

---

## H4: Global RMs Can Be Trained on More Data

> **Hypothesis**: Reward Models (especially global RMs) benefit from being trained on wider, more diverse datasets, unlike policies.

---
## H4 Exps: Online Still Wins with Narrow Data

**Setup**: All models (RMs & policies) were trained on a **narrow, SFT-generated dataset**.

**Result**: Online DPO still improved performance.

<p align="center" style="margin: 0;">
  <img src="https://img.qihang-zhang.com/2025/10/f331d3fb780d4d47a8c6c09fd57c8221.png" alt="figs/pythia_gpt.png" style="max-width: 40%; height: auto;" />
</p>
Notes:

It is narrow because all responses come from the same SFT model and are judged by another AI (gpt-4o), rather than raw human diversity preferences.

However, On.DPO (DPO) (orange bars, two columns on the right) still shows improved performance, even exceeding On.DPO (SFT). This suggests that even with a narrow data source, online iteration can still lead to improvements.

This contradicts the prediction of H4 (that the RM advantage depends on data breadth). If the RM advantage lies in data breadth, then this advantage should disappear when the data is narrow.

---

## H5: Global RMs Generalize Better OOD

> **Hypothesis**: Global RMs possess superior Out-of-Distribution (OOD) generalization compared to policies.

---
## H5 Exps: RM OOD Generalization Correlates with In-Distribution

* **Finding**: Global RMs achieve higher in-distribution validation likelihood, which **perfectly correlates** with better OOD performance.

-v-

* **Reframing the Question**: This leads to: **Why is learning a generalizable RM easier than learning a generalizable policy?**

Notes:

Global RM: Accepts the entire sequence as input and outputs a score.
Local RM: Sum the log-probabilities of each token in the generated sequence.
Experiment 7: Likelihood is similar to accuracy; higher is better.
Experiment 8: Use off-line DPO and SFT, then use RM to score the best of N and compare with GPT 4o.

---

## All Roads Lead to Likelihood

<div class="fragment">

-  **Prove:** DPO is equivalent to a RM under certain assumption.
</div>

<div class="fragment">

-  Attribute the performance gap to **a hypothesis aligned with the assumption**.
</div>

- Propose **Hypothesis 6:** The Generation-Verification Gap & Proper Policy Learning

</div>

---

## The Equivalences Between DPO and RM

Propose A [**Local Reward Model**](#local-reward-model-a-explicit-expression-of-reward-derived-from-DPO) to explicitly express the reward derived from DPO

</div>

Propose The [**Assumption**](#the-assumption-used-to-prove-the-equivalence) Used to Prove the equivalence

</div>

Prove: [**DPO is Equivalent to a RM**](#prove-DPO-is-equivalent-to-a-rm-under-certain-assumption) Under Certain Assumption

</div>

---

## Global And Local Reward Model:

#### Global Reward Model: The Usual Reward Model In RLHF

$$
r(\xi,s_0) := r(a _{0:h}, s_0) \Big| \\; s_0 \sim \rho_0, \\; r \in \mathcal{R}
$$

</div>

#### Local Reward Model: An Explicit Expression Of Reward Derived From DPO

$$
\mathcal{R}(\Pi) = \Big\\{ r_\pi(\xi \mid s_0) := \sum_{h=0}^{H} \log \pi(a_h \mid a _{0:h-1}, s_0) \Big| \pi \in \Pi, \\;  s_0 \sim \rho_0 \Big\\}
$$

</div>

where $\Pi$ is the set of all possible policies, $\mathcal{R}(\Pi) \iff \Pi$

</div>

Notes: 
### Pronounciation
- The pronunciation of $\xi$ is "ksee" - IPA /ksi/
- $\iff$ is pronounced as "equivalent to"
---

## The Assumption Used To Prove The Equivalence

**Assumption:** $\mathcal{R}(\Pi) \iff \Pi$

</div>

i.e. They both cover the same set of reward functions regardless of how they are represented.

</div>

> what's this meaning in engineering perspective?
> 
> *No matter what the policy and reward model is, the parameters of a same transformer can be optimized with the same difficulty.*
</div>

---

## Equivalence Proven: DPO Target

$$
\pi^\star = \operatorname*{argmin}_{\pi \in \Pi} \\; \mathbb{D} _{ \mathrm{KL}}(\mathbb{P}_D || \mathbb{P} _{\pi}) + \beta \\; \mathbb{D} _{ \mathrm{KL}}(\mathbb{P} _{\pi} || \mathbb{P} _{\pi _{ref}})
$$

#### BT Model Target

$$
\mathbb{P}^{\mathrm{BT}} _{r _{\theta}} \\; \left(\xi_1 \succ \xi_2 \mid s_0\right)
= \sigma \\; \big(r _{\theta}(\xi_1 \mid s_0) - r _{\theta}(\xi_2 \mid s_0)\big), \\; s_0 \sim \rho_0
$$

$$
r_\theta(\xi \mid s_0) = \log \pi_\theta(\xi \mid s_0) = \sum_{h=0}^{H} \log \pi_\theta(a_h \mid s_0, a _{0:h-1})
$$

</div>

$$
\mathbb{P}^{\mathrm{BT}} _{D} \\; \left(\xi_1 \succ \xi_2 \mid s_0\right)
= \mathbf{1} (\xi_1 = \xi^{+}) \\\ \mathbb{P}^{\mathrm{BT}} _{r _{\theta}} \\; \left(\xi_1 \succ \xi_2 \mid s_0\right)
= \sigma \\; \big(\log \pi_\theta(\xi_1 \mid s_0) - \log \pi_\theta(\xi_2 \mid s_0)\big) 
$$
</div>
---
## Equivalence Proven: Maximum Entropy In RLHF
How can we get the soft-optimal policy from a given **global reward model**?

$$
\text{Example:} \\; \pi _{r _{\theta}}^\star
= \operatorname*{argmax} _{\pi\in\Pi} \\;
\mathbb{E} _{\xi\sim\pi} \\! \left[r _{\theta}(\xi \mid s_0)\right] + \mathcal{H}(\pi)
$$

$$
\mathbb{P} _{r _{\theta}}^\star(\xi \mid s_0)
= \frac{\exp\\!\big(r _{\theta}(\xi \mid s_0)\big)}
       {\sum _{\xi'\in \Xi \mid s_0 \sim \pi _{r _{\theta}}^\star}\exp\\!\big(r _{\theta}(\xi' \mid s_0)\big)}
= \frac{\exp\\!\big(r _{\theta}(\xi \mid s_0)\big)}{Z(r _{\theta},s_0)}
$$
</div>

$$
\mathbb{P} _{r _{\theta}}^{\mathrm{BT}}\\!\left(\xi_1 \succ \xi_2 \mid s_0\right) = \sigma\\!\Big(\log \mathbb{P} _{r _{\theta}}^\star(\xi_1 \mid s_0) - \log \mathbb{P} _{r _{\theta}}^\star(\xi_2 \mid s_0)\Big)
$$

</div>

Notes:
$\pi _{r _{\theta}}^\star$: soft-optimal policy

$\xi'$: ksee prime

---
## Equivalence Proven: Maximum Entropy In RLHF

$$
\operatorname*{max} _{\pi\in\Pi} \\;
\mathbb{E} _{\xi\sim\pi} \\! \left[r _{\theta}(\xi \mid s_0)\right] + \mathcal{H}(\pi) \iff \operatorname*{min} _{\xi \sim \pi} \mathbf{D} _{\mathrm{KL}}(\mathbb{P}_D || \mathbb{P}_\pi) - \mathcal{H}(\pi)
$$

<div class="fragment">
The Optimization Objectives are Equivalent under the Assumption: $\mathcal{R}(\Pi) \iff \Pi$
</div>

---

## H6: The Generation-Verification Gap & Proper Policy Learning

> **Hypothesis**: The core reason is the **"Generation-Verification Gap"**. **Verification (by RM) is fundamentally simpler than Generation (by Policy)**. Online RLHF transforms a difficult "improper learning" problem into an easier "proper learning" problem.

---

## H6: The Generation-Verification Gap & Proper Policy Learning

**RM Stage**: Learn a **relatively simple verifier** ($\hat{r}_{\text{sim}}$). This is an "easy" problem.

**RL Stage**: Constrain policy search to the **subset of policies optimal for $\hat{r}_{\text{sim}}$** ($\Pi(R_{\text{sim}})$). This reduces the vast search space, making it a "proper learning" problem.

**Contrast**: Offline methods (like DPO) attempt "improper learning" by directly optimizing policies across the entire, complex space ($\Pi$).

---

## H6 Exps: Filling the Gap Eliminates Advantage

> **Prediction**: If the gap is the cause, then **reducing or eliminating this gap** should remove online PFT's performance advantage.

---

## H6 Exps: Exp 1 - Simplified Generation

**Setup**: Task: **Two-word summarization**.
* **Impact**: "Generation" difficulty drastically reduced. Gap shrinks.
   
**Result**: Online DPO's advantage over offline DPO became **minimal**.

---

## H6 Exps: Exp 2 - Harder Verification

**Setup**: Task: Use **ROUGE-L** as reward signal.
*   **Impact**: "Verification" complexity increased to match "generation." Gap shrinks.
  
**Result**: Online DPO showed **no improvement** over offline DPO.
<p align="center" style="margin-top: 0;">
  <img src="https://img.qihang-zhang.com/2025/10/b1bff3d9d9077a658f48c3ed2a270674.png" alt="figs/pythia_rouge.png" style="max-width: 40%; height: auto;" />
</p>

---

# Thanks For Listening!