论文标题
dall-e和flamingo互相了解吗?
Do DALL-E and Flamingo Understand Each Other?
论文作者
论文摘要
侧重于图像和文本的理解和创建的多模式研究领域取得了显着的进步。该进步的体现是通过大规模图像字幕的复杂模型的出现来体现的,例如著名的火烈鸟模型和文本对图像生成模型,dall-e是一个突出的示例。值得探索这个领域的一个有趣的问题是Flamingo和Dall-E是否相互了解。为了研究这个问题,我们提出了一项重建任务,其中火烈鸟为给定图像生成描述,而dall-e则使用此描述作为综合新图像的输入。我们认为,如果生成的图像与给定的图像相似,那么这些模型相互了解。具体而言,我们研究图像重建质量与文本生成之间的关系。我们发现,对图像的最佳描述是产生类似于原始图像的生成图像的描述。这一发现激发了我们提出一个统一的框架,以验证文本对象和图像对文本模型。具体而言,重建部分形成正规化损失,以指导模型的调整。具有不同图像字幕和图像生成模型的多个数据集上的广泛实验验证了我们的发现,并证明了我们提出的统一框架的有效性。由于dall-e和flamingo尚未公开,因此我们在其余工作中使用稳定的扩散和碎片。项目网站:https://dalleflamingo.github.io。
The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work. Project website: https://dalleflamingo.github.io.