Extended definition
Diffusion models are a family of deep generative models that learn to synthesize data by inverting a noising process. The mechanism has two stages. In the forward stage, Gaussian noise is added to the data over many small steps, until the original sample becomes pure noise. In the reverse stage, a neural network learns to undo this process step by step, starting from noise and recovering a sample coherent with the data distribution. Croitoru and colleagues (2023) organize the field into three equivalent formulations: denoising diffusion probabilistic models (DDPM), noise-conditioned score-based networks, and the stochastic differential equation formulation. Yang and colleagues (2023), in the reference review, show that these three views describe the same principle and organize research around efficient sampling, improved likelihood estimation, and handling data with special structure. Latent diffusion, which operates in a compressed space rather than raw pixels, is what made image generation practical at scale.
When it applies
Diffusion models apply when the task is to generate high-fidelity data with good mode coverage: image, video, audio synthesis, and molecule design. Cao and colleagues (2024) document why diffusion became the dominant paradigm: it achieves quality and diversity superior to generative adversarial networks and trains more stably, without the mode collapse that affects GANs. It applies well to conditional generation, where the output is guided by text, a semantic mask, or a reference image, the basis of text-to-image systems. It also applies to low-level inverse problems such as super-resolution, denoising, and inpainting, where diffusion works as a strong generative prior. In research, it is the engine behind synthetic image data used for data augmentation.
When it does not apply
Diffusion models do not apply well when latency matters. Iterative sampling requires dozens to hundreds of network evaluations, which makes them slow and costly compared with a GAN, which generates in a single step; Yang and colleagues (2023) treat efficient sampling as the main open problem precisely for this reason. They do not apply without relevant computational cost: training and inference consume memory and energy that rule out edge deployment without compression. They do not apply as a solution for scarce data: quality depends on large training volumes, and in small regimes other methods compete better. And they do not apply where interpretability of the generative process is required, since the denoising trajectory offers no direct explanation of the produced sample.
Applications by field
- Computer vision: high-resolution image generation and editing, super-resolution, inpainting, and image-to-image translation with a diffusion prior.
- Medical imaging: reconstruction, denoising, and synthetic image generation for data augmentation, with the caveat of validating clinical fidelity.
- Life sciences and chemistry: molecule and structure design, where diffusion samples candidates from a high-dimensional space.
- Audio and video: speech and temporal-sequence synthesis, areas where diffusion’s mode coverage surpasses that of earlier models.
Common pitfalls
The first pitfall is ignoring sampling cost: prototyping with a diffusion model without sizing the number of steps leads to inference that is infeasible in production. The second is conflating latent diffusion with pixel diffusion: operating in latent space radically changes cost and quality, and treating the two as equivalent misleads planning. The third is assuming more steps always improve the output; there is a point of diminishing returns, and the choice of sampler matters as much as the step count. The fourth is using synthetic diffusion images as real data without checking for bias and memorization leakage from training. The fifth is evaluating generation by a single metric such as FID, which captures aggregate fidelity but does not detect semantic failures, local artifacts, or lack of adherence to the provided condition.