Discussion about this post

User's avatar
The Intelligence Layer (TIL)'s avatar

Hey! Thanks for the great post about DPO vs RLHF. Just a couple of things:

1. DPO was runner up for last year's NeurIPS (2023).

2. I'm not sure whether this is just my browser or substack issue, but the MathJax isn't rendering on the web browser version of Substack.

\( \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\textcolor{green}{\pi_\theta(y_w \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_w \mid x)}} - \beta \log \frac{\textcolor{red}{\pi_\theta(y_l \mid x)}}{\textcolor{blue}{\pi_{\text{ref}}(y_l \mid x)}} \right) \right]. \)

Expand full comment
1 more comment...

No posts