Better Aligning Text-to-Image Models with Human Preference

1Multimedia Laboratory, The Chinese University of Hong Kong 2SenseTime Research
3Qing Yuan Research Institute, Shanghai Jiao Tong University
4Centre for Perceptual and Interactive Intelligence (CPII)

Abstract


TL;DR: Stable Diffusion can be improved via learning from human preferences. The trained model is better aligned
with user intentions, and also produce images with less artifacts, such as weird limbs and faces.

Banner.

Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human aesthetic preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using the HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human aesthetic preferences. Our experiments show that the HPS outperforms CLIP in predicting human choices and has good generalization capability towards images generated from other models. By tuning Stable Diffusion with the guidance of the HPS, the adapted model is able to generate images that are more preferred by human users.

Pipeline Overview


pipeline.

Left: We firstly train a human preference classifier to predict the human choice based on the prompt text, and then derive human preference score (HPS) based on the trained classifier, which complements image quality assessments by incorporating human aesthetic preferences. Right: adapting Stable Diffusion to generate preferable images. During training, the Stable Diffusion is tuned to associate the concept of non-prefer with the prompt prefix [Identifier]. During inference, [Identifier] is used as the negative prompt in classifier free guidance.

Results


The model trained with human preferences generates images that better align with users' intentions and have less artifacts. More visualization can be found in the supplementary materials .


results: intention.

results: topology.


BibTeX

@article{wu2023alsd,
  title={Better Aligning Text-to-Image Models with Human Preference},
  author={Xiaoshi Wu and Keqiang Sun and Feng Zhu and Rui Zhao and Hongsheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2303.14420}
}