Discussion about this post

User's avatar
Hojae Lee's avatar

Hey! Thanks for the great post about DPO vs RLHF. Just a couple of things:

1. DPO was runner up for last year's NeurIPS (2023).

2. I'm not sure whether this is just my browser or substack issue, but the MathJax isn't rendering on the web browser version of Substack.

\( \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\textcolor{green}{\pi_\theta(y_w \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_w \mid x)}} - \beta \log \frac{\textcolor{red}{\pi_\theta(y_l \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_l \mid x)}} \right) \right]. \)

1 more comment...

No posts

Ready for more?