Direct Preference Optimization (DPO…

AI Coffee Break with Letitia

Dec 27, 2024

A Simpler Way to Fine-Tune Language Models than with RLHF

Read →

2 Comments

The Intelligence Layer (TIL)

Jan 6

Hey! Thanks for the great post about DPO vs RLHF. Just a couple of things:

1. DPO was runner up for last year's NeurIPS (2023).

2. I'm not sure whether this is just my browser or substack issue, but the MathJax isn't rendering on the web browser version of Substack.

\( \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\textcolor{green}{\pi_\theta(y_w \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_w \mid x)}} - \beta \log \frac{\textcolor{red}{\pi_\theta(y_l \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_l \mid x)}} \right) \right]. \)

Expand full comment

Reply (1)

AI Coffee Break with Letitia

Jan 6

Hey, thanks a lot for noticing and for taking your time to let me know! 🤗

1. I've updated the NeurIPS year.

2. I've replaced the LaTeX with pictures, since Substack fails to render them correctly in the published version. Somehow, the equations render almost correctly in the editing version of the post. So yeah, I'll put pictures in here and I'll go on to fix all my other posts. Quite sad, actually that LaTeX isn't working properly. :(

Expand full comment