Diffusion Model Alignment with Direct Preference Optimization
Packard 101
Abstract: Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users’ preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. We show that Diffusion-DPO significantly improves performance of state-of-the-art image generation models on visual appeal and prompt alignment, including Stable Diffusion-(XL and 3). We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of alignment methods.
Bio: Nikhil Naik is an AI Research Scientist on the Meta Llama team. His research interests are in computer vision, multimodality and foundation models. He obtained his PhD from MIT in 2016 and has worked at Google, Microsoft Research, and Salesforce Research in the past. Please see his website for more details on his work: mit.edu/naik