Diffusion Model Inference Scaling A New Paradigm

Introduction to Inference Scaling in Diffusion Models

Recent advancements in Large Language Models (LLMs) have shown that scaling during inference can significantly boost performance. Models like o1, o3, DeepSeek R1, QwQ, and Step Reasoner mini exemplify this trend. This raises a compelling question: can this principle be effectively applied to diffusion models? A team at New York University, led by Xie Saining, has investigated this very question, revealing that inference-time scaling is indeed effective for diffusion models.

Key Findings of the Study

The research highlights several key findings:

Inference-time scaling is effective: Allocating more computational resources during inference leads to higher quality samples.
Flexibility in component combinations: The framework allows for different component configurations, adapting to diverse applications.
Beyond denoising steps: The research suggests that searching for better noise during sampling is another dimension for scaling NFE, not just increasing denoising steps.
Two design axes: The framework focuses on two key aspects:
- Verifiers: Providing feedback during the search process.
- Algorithms: Finding better noise candidates.

Research Methodology and Scenarios

The team explored three scenarios for verifiers, simulating various use cases:

Scenarios with privileged information about the final evaluation.
Scenarios with conditional information to guide the generation.
Scenarios with no additional information available.

For algorithms, they investigated:

Random Search: Selecting the best from a fixed set of candidates.
Zero-Order Search: Iteratively improving noise candidates using verifier feedback.
Path Search: Iteratively improving diffusion sampling trajectories using verifier feedback.

Initial experiments were conducted using a simple ImageNet class-conditional generation setup, before applying the designs to larger-scale text-conditional generation.

Scaling Inference Time Framework

The paper proposes a framework for scaling inference time in diffusion models, framing the challenge as a search for optimal sampling noise. The process involves two core components:

Verifiers: Pre-trained models assessing the quality of generated samples, outputting a scalar score.
Algorithms: Algorithms using verifier scores to find better candidate samples.

The total inference budget is measured by the total number of function evaluations (NFE), including both denoising steps and search costs.

Search Verifiers and the Challenge of "Verifier Hacking"

The researchers started with an Oracle verifier, which has complete information about the final evaluation of selected samples. For ImageNet, metrics like FID and IS were used. They then explored more accessible pre-trained models as supervised verifiers, such as CLIP and DINO, classifying samples and selecting the one with the highest logit corresponding to the class label.

However, these classifiers, operating point-wise, only partially align with the objectives of the FID score. This led to a reduction in sample variance and mode collapse as computation increased. This phenomenon, termed "verifier hacking," was accelerated by the unconstrained search space of the random search algorithm.

The study also found that verifiers don't always need conditional information to guide the search. A strong correlation was observed between the logits from DINO/CLIP classifiers and the cosine similarity of the feature space, leading to the use of self-supervised verifiers that exhibited effective scaling behavior.

Search Algorithms: Mitigating Verifier Hacking

To mitigate verifier hacking, the researchers explored more refined search algorithms:

Zero-order search:
1. Start with random Gaussian noise as a pivot point.
2. Find N candidates in the pivot point's neighborhood.
3. Run candidates through the ODE solver to obtain samples and verifier scores.
4. Update the pivot point with the best candidate and repeat.
Path search:
1. Sample N initial noise samples and run the ODE solver to a noise level σ.
2. Add noise to each sample, simulating a forward noising process.
3. Run an ODE solver on each noisy sample and keep the top N candidates, repeating until σ=0.
4. Randomly search the remaining N samples and keep the best one.

Both zero-order and path search algorithms maintain a strong locality compared to random search.

Scaling in Text-to-Image Scenarios

The team examined the scaling capabilities of the search framework in larger-scale text-to-image tasks, using DrawBench and T2I-CompBench datasets with the FLUX.1-dev model as the backbone. They expanded the selection of supervised verifiers, including Aesthetic Score Predictor, CLIPScore, and ImageReward, and created a Verifier Ensemble by combining these three.

Analysis: Verifier-Task Alignment

The study compared various verifier-algorithm combinations on different datasets. On DrawBench, using all verifiers generally improved sample quality. However, using Aesthetic and CLIP verifiers in isolation could lead to overfitting their biases, stemming from a mismatch in their evaluation focus. Aesthetic Score focuses on visual quality, while CLIP prioritizes visual-text alignment. They noted that the effectiveness of a verifier depends on its alignment with the task requirements.

Algorithm Performance and Compatibility

The three search algorithms (Random, Zero-Order, and Path) all improved sampling quality on DrawBench. However, Random Search outperformed in some aspects due to the local nature of the other two methods. Random search converged more quickly to verifier bias, while the other two require improvement on less-than-optimal candidates.

The team investigated the compatibility of their search method with fine-tuned models. Using a DPO-fine-tuned Stable Diffusion XL model, they found that the search method could be generalized to different models and improve the performance of already aligned models.

Effects of Different Dimensions of Inference Computation

The study explored how different aspects of inference computation affect results:

Number of search iterations: Increasing iterations brings the noise closer to the optimum.
Computation per search iteration: Adjusting the number of denoising steps per iteration reveals different computationally optimal regions.
Final generation computation: The team used optimal settings for the final denoising steps to ensure the highest final sample quality.

Effectiveness of Investment in Computation

The researchers explored the effectiveness of inference-time scaling on smaller diffusion models. They found that, for ImageNet, scaling smaller models can be very efficient. In certain cases, searching on a smaller model can outperform larger models without search. However, the effectiveness depends on the baseline performance of the smaller model.

In text-based settings, PixArt-Σ, using only a fraction of the computation, outperformed FLUX-1.dev. These results demonstrate that significant computational resources spent during training can be offset by smaller amounts of computation during generation, resulting in higher quality samples more efficiently.