Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution – Paper Explained

Learn about how diffusion models can finally generate good quality text.

Aug 24, 2024

We've combed through the complex mathematics and dense pages of the “Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution” research paper to bring you the essential insights and key takeaways.

The paper won the ICML 2024 best paper award. Congrats to the authors!👏

« Optional: Enjoy this content in video format 👇! »

Diffusion Models are Finally Conquering Text Generation: A Deep Dive

☕Grab your favorite coffee cup because today, we’ve got some exciting news to share—Diffusion models can finally produce text that doesn’t look like a coffee bean walked across the keyboard!

The Rise of Diffusion Models in AI

Diffusion models have been making waves, especially for their ability to generate stunning images, realistic audio, and even videos. But when it came to text generation, they lagged behind, producing results that were more "word salad" than coherent sentences. However, that’s all starting to change.

If you need a refresher on how diffusion models work, we've covered this topic extensively in previous explainers.

The Challenge of Text Generation

Until recently, diffusion models struggled with text generation, often producing jumbled and nonsensical outputs. Here's an example of a text generated by an autoregressive diffusion model presented in 2022 at ICLR:

Output of the Autoregressive Diffusion Model presented by Hoogeboom et al. 2022.

Clearly, this wasn't ready to compete with autoregressive language models like GPT. But now, discrete diffusion models are stepping up to the plate, producing text that’s far more coherent and structured, signalling that GPT-style language models may have found a worthy competitor.

Output of the Discrete Diffusion model presented by Lou et al. 2024.

But diffusion models have some key advantages over autoregressive LLMs, such as the ability to accept prompts anywhere—in the beginning, middle, end, or even split across the input. Additionally, they can, in principle, generate multiple tokens at once.

Why Text Generation is Hard for Diffusion Models

The success of diffusion models in generating images and audio lies in their ability to handle continuous data—pixels in an image or sound represented as spectrogram images in audio. These types of data can be smoothly transitioned into noise and then denoised during the generation process. However, text is symbolic and discrete, making this process much more challenging.

Adding noise in images (could go in 200 steps) and adding noise in text by either introducing random words (in orange **middle**) or MASK tokens (in orange **right**). Text noising could go in at most 6 steps, making the forward diffusion process much more discrete.

In traditional diffusion models, noise is added to data like images in a step-by-step process until the data becomes unrecognizable. The model then learns to reverse this process, denoising the data step by step. For images and audio, this works beautifully because the data is continuous, allowing for smooth transitions.

But with text, the situation is different. Text is made up of discrete tokens (words or characters), and adding noise to these tokens is not as straightforward. Simply replacing words with random ones or masking them introduces jumps in the data that are too abrupt, making the generation process much more complex.

Enter Discrete Diffusion Models

The key idea of the discrete diffusion model paper is fairly simple, despite the paper being very dense and math heavy:

Instead of directly noising the tokens, the idea is to perform diffusion on each token’s probability vector. This vector represents the probability of a token being a specific word from the vocabulary.

For example, imagine a token in a sentence has a 100% probability of being the word "cat" and 0% for every other word in the vocabulary. This is typically represented as a one-hot-encoded vector, where the position corresponding to "cat" is marked with a 1, and all other positions in the vocabulary are marked with 0 (as illustrated in the figure below 👇).

This continuous representation allows the diffusion model to learn how to revert the probabilities back to the original word during denoising, enabling the generation of coherent text. When denoising, soft probabilities on the vocabulary, are enough to reconstruct the original text.

How It Works: The Forward and Backward Diffusion Processes

The forward diffusion process begins by gradually adding noise to the probability vectors, making them more uncertain at each step. The challenge for the model is to learn how to reverse this process. The backward diffusion involves predicting a "concrete score" for each token, which helps the model restore the original probabilities.

This approach is similar to how the BERT language model works, which also operates on masked tokens and predicts the probabilities over a vocabulary. So in a sense, yes, BERT is a discrete diffusion model, but there are some key differences that set it apart:

The amount of masking during training. BERT is trained with only about 15% of tokens masked at any given time, whereas discrete diffusion models are trained with a much wider range of masking—from 0% to 100%. This means that BERT struggles if given a sequence with a high number of masked tokens, as it’s not used to dealing with such scenarios during training.
Perhaps the most crucial distinction, however, lies in their loss functions. BERT employs a standard cross-entropy loss, which teaches the model to predict the correct word for each masked token based on the surrounding context. Discrete diffusion models, on the other hand, use a more complex loss function, adopting a more mathematically rigorous perspective on the entire process. We talk about in the next section.

Score Entropy Diffusion Models

In the realm of discrete diffusion models, the forward diffusion process is crucial for generating the training data needed to produce coherent text. This process involves creating noisier and noisier samples that the diffusion model will later learn to denoise. Here’s how it works:

At each step of the forward diffusion process, a token is randomly selected, and its probability vector p_t is multiplied by a matrix Q to produce the noisier version of the probability vector, p_{t+1}, as such:

\(p_{t+1} = Q p_t\)

Depending on the structure of matrix Q, this can mean the token’s probabilities are shifted to represent a random word or masked entirely. For instance, the following matrix Q_uniform will make the probabilities point to a random word (N being the size of the vocabulary), while another structure, exemplified in Q_absorb would set all probabilities to zero except for the MASK token.

\(Q_{\text{uniform}} = \begin{bmatrix} 1 - N & 1 & \cdots & 1 \\ 1 & 1 - N & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & 1 - N \end{bmatrix} \quad Q_{\text{absorb}} = \begin{bmatrix} -1 & 0 & 0 & \cdots & 0 \\ 0 & -1 & 0 & \cdots & 0 \\ 0 & 0 & \ddots & 0 & 0 \\ 0 & 0 & \cdots & -1 & 0 \\ 1 & 1 & \cdots & 1 & 0 \end{bmatrix}\)

This process is mathematically represented by a diffusion equation in the paper, indicating that the change (or the "flipping" of the token) is caused by a linear transformation applied to the token’s probabilities during forward diffusion.

\(\frac{dp_t}{dt} = Q_t p_t \quad p_0 \approx p_{\text{data}}\)

Now, the backward diffusion process comes into play. The model’s job here is to reverse the noise added during forward diffusion. To do this, it needs to find the matrix \bar{Q}, which, when multiplied by p_{t+1}, reproduces the original probability vector p_t.

\(p_t = \bar{Q}p_{t+1}\)

Although this is not straightforward, it turns out that the inversion matrix \bar{Q} is actually a probability ratio multiplied by the original matrix Q! This ratio is called the "concrete score" (s_theta). By multiplying this score with the original diffusion matrix Q, the model can reverse the diffusion process. This comes in super handy, because never in this computation, do we need to compute intractable partition functions, yay!

\(\bar{Q} = \frac{p_{t+1}}{p_t} Q \quad s_{\theta}:=\frac{p_{t+1}}{p_t}\)

The authors of the paper train a transformer model to predict this concrete score. Since the training data includes the probabilities generated during the forward diffusion process, they can easily calculate this concrete score and teach the model to output it using a cross-entropy-like loss function:

\(\mathbb{E}_{x_0 \sim p_0 \atop x \sim p(\cdot | x_0)} \left[ \sum_{y \neq x} w_{xy} \left( s_\theta(x)_y - \frac{p_{t+1}}{p_t} \log s_\theta(x)_y \right) \right] \)

Once the model has learned to predict the concrete score, it can be used to generate text during inference. Starting with a sequence where all probabilities are uniformly noisy or masked (except for the prompt tokens), the discrete diffusion model denoises the sequence token by token. For each token, the model predicts the concrete score, which is then multiplied by the forward diffusion matrix Q and the noisy probabilities p_{t+1} to produce the probability of the predicted token. Repeating this process token by token eventually generates the final coherent text sequence.

Results and Implications

The results are promising. The discrete diffusion models introduced in the paper, called SEDD, showed perplexity scores (a measure of how well the model reproduces text) comparable to GPT-2, a significant achievement given that diffusion models were previously far behind in text generation quality.

Results of the SEDD discrete diffusion model compared to GPT2 as presented by Lou et al. 2024.

While these models are not yet at the level of ChatGPT, they are a strong competitor to GPT-2, especially considering their relatively smaller size—320 million parameters compared to GPT-2's 340 million. This success suggests that with further development, diffusion models could potentially surpass current autoregressive models in both quality and efficiency.

Thoughts

The advent of discrete diffusion models marks a significant step forward in diffusion models’ ability to generate text. Diffusion models have some key advantages over autoregressive LLMs, such as the ability to accept prompts anywhere—in the beginning, middle, end, or even split across the input. Additionally, they can, in principle, generate multiple tokens at once. While they are not yet ready to dethrone models like ChatGPT, they offer a glimpse into a future where diffusion models could provide a more flexible and powerful alternative to the autoregressive approaches currently dominating the field.

But I think that even this approach could scale in principle, there's a significant challenge: We've already invested heavily in hardware and software optimizations for GPTs / autoregressive transformers. Given the sunken cost fallacy, it's hard to imagine tech giants abandoning their current LLMs to start training diffusion LLMs, especially since it could take years for them to catch up to ChatGPT and similar models. Much like MAMBA, I fear discrete diffusion might also lose the hardware/software lottery.

If you're interested in diving deeper into the technical details, I highly recommend checking out the original paper and a talk by the first author, Aaron Lou.

Thank you for joining me on this dive into diffusion models! To stay updated on the latest in AI, be sure to subscribe to the blog and follow along on YouTube. And of course, enjoy your coffee break with the next exciting installment of AI Coffee Break!

AI Coffee Break with Letitia

Discussion about this post