Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao1, Siavash Khodadadeh2, Kevin Duarte2, Wei-An Lin2, Hui Qu2, Mingi Kwon3, Ratheesh Kalarot2
1University of Michigan, 2Adobe Research, 3Yonsei University
hsiaoyt@umich.edu, {khodadad, kduarte, wlin, hqu, kalarot}@adobe.com, kwonmingi@yonsei.ac.kr
Descriptive text about the image

TL; DR: We use a light-weight guide model to replace CFG while keeping the base model untouched. The guide model can be "Plug and Play" to different base models in a flash.

Abstract

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen.

We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

Video

Results

Descriptive text about the image

Plug-and-play to Different Base Models

As a bi-product of our methodology, once the guide model is trained, it can be plug-and-play to different fine-tuned base models without re-traiing.

Descriptive text about the image

Feature Map Visualization from the Guide Model

In the early stage of the iteration process, guide model has stronger injection (larger absolute values) strength, and in the later stage, the injections mainly focus on high-frequency details with lower strength. Also, the lower guidance number has lower feature map injections, higher guidance number has stronger feature map injections.

Descriptive text about the image
Descriptive text about the image

BibTeX

@inproceedings{hsiao2024plug,
                  title={Plug-and-Play Diffusion Distillation},
                  author={Hsiao, Yi-Ting and Khodadadeh, Siavash and Duarte, Kevin and Lin, Wei-An and Qu, Hui and Kwon, Mingi and Kalarot, Ratheesh},
                  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                  pages={13743--13752},
                  year={2024}
            }
}