Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen.
We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.
As a bi-product of our methodology, once the guide model is trained, it can be plug-and-play to different fine-tuned base models without re-traiing.
In the early stage of the iteration process, guide model has stronger injection (larger absolute values) strength, and in the later stage, the injections mainly focus on high-frequency details with lower strength. Also, the lower guidance number has lower feature map injections, higher guidance number has stronger feature map injections.
@inproceedings{hsiao2024plug,
title={Plug-and-Play Diffusion Distillation},
author={Hsiao, Yi-Ting and Khodadadeh, Siavash and Duarte, Kevin and Lin, Wei-An and Qu, Hui and Kwon, Mingi and Kalarot, Ratheesh},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13743--13752},
year={2024}
}
}