Transformers in Diffusion Models for Image Generation and Beyond

Sayak Paul (HuggingFace)
Gates B01

May

This event ended 331 days ago.

Tue, May 27 2025, 3pm

Presentation Abstract: Diffusion models have been all the rage in recent times when it comes to generating realistic yet synthetic continuous media content. This talk covers how Transformers are used in diffusion models for image generation and goes far beyond that.

We set the context by briefly discussing some preliminaries around diffusion models and how they are trained. We then cover the UNet-based network architecture that used to be the de facto choice for diffusion models. This helps us to motivate the introduction and rise of transformer-based architectures for diffusion.

We cover the fundamental blocks and the degrees of freedom one can ablate in the base architecture in different conditional settings. We then shift our focus to the different flavors of attention and other connected components that the community has been using in some of the SoTA open models for various use cases. We conclude by shedding light on some promising future directions around efficiency.

Speaker Bio: Sayak works on diffusion models at Hugging Face. His day-to-day includes contributing to the diffusers library, training and babysitting diffusion models, and working on applied ideas. He’s interested in subject-driven generation, preference alignment, and evaluation of diffusion models. When he is not working, he can be found playing the guitar and binge-watching ICML tutorials and Suits.

Talk Logistics: We will have an opportunity for questions after the talk. You can submit any questions for the speaker on sli.do, using the code #cs25.

Computer Science Department Lecture Series (CS 300)