实施和实验文本到图像生成的扩散模型

论文标题

实施和实验文本到图像生成的扩散模型

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

论文作者

Zbinden, Robin

论文摘要

利用深度学习的最新进展，文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明了可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法，文本对图像模型可以产生许多不同类型的高分辨率图像，其中人类想象力是唯一的极限。但是，这些模型需要大量的计算资源来训练，并处理从互联网收集的大量数据集。此外，代码库和模型均未发布。因此，它可以防止AI社区尝试这些尖端模型，从而使其结果复制变得复杂，即使不是不可能。在本论文中，我们的目标是首先回顾这些模型使用的不同方法和技术，然后提出自己的文本模型模型的实施来做出贡献。高度基于DALL-E 2，我们引入了几种轻微的修改，以应对所引起的高计算成本。因此，我们有机会进行实验，以了解这些模型的能力，尤其是在低资源制度中。特别是，我们提供的其他分析要比Dall-E 2的作者（包括消融研究）更深入。此外，扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法，该方法可以与其他指导方法一起使用，以提高图像质量。最后，我们的模型产生的图像质量相当好，而不必维持最先进的文本对图像模型的重大培训成本。

Taking advantage of the many recent advances in deep learning, text-to-image generative models currently have the merit of attracting the general public attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image. Based on a novel approach for image generation called diffusion models, text-to-image models enable the production of many different types of high resolution images, where human imagination is the only limit. However, these models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet. In addition, neither the codebase nor the models have been released. It consequently prevents the AI community from experimenting with these cutting-edge models, making the reproduction of their results complicated, if not impossible. In this thesis, we aim to contribute by firstly reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model. Highly based on DALL-E 2, we introduce several slight modifications to tackle the high computational cost induced. We thus have the opportunity to experiment in order to understand what these models are capable of, especially in a low resource regime. In particular, we provide additional and analyses deeper than the ones performed by the authors of DALL-E 2, including ablation studies. Besides, diffusion models use so-called guidance methods to help the generating process. We introduce a new guidance method which can be used in conjunction with other guidance methods to improve image quality. Finally, the images generated by our model are of reasonably good quality, without having to sustain the significant training costs of state-of-the-art text-to-image models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题