论文标题
从未分割的烹饪视频生成食谱
Recipe Generation from Unsegmented Cooking Videos
论文作者
论文摘要
本文从未分割的烹饪视频中解决了食谱的生成,该任务要求代理(1)提取完成盘子时提取关键事件,以及(2)为提取的事件生成句子。我们的任务类似于密集的视频字幕(DVC),该字幕旨在彻底检测事件并为其生成句子。但是,与DVC不同,在食谱生成中,食谱故事意识至关重要,模型应以正确的顺序提取适当数量的事件,并根据它们生成准确的句子。我们分析了DVC模型的输出,并确认(1)(1)几个事件可作为食谱故事采用,(2)此类事件的生成句子并未基于视觉内容。基于此,我们设定了通过从输出事件中选择Oracle事件并为其重新生成句子来获得正确的配方的目标。为了实现这一目标,我们提出了一种基于变压器的多模式复发方法训练事件选择器和句子生成器,用于从DVC的事件中选择Oracle事件并为其生成句子。此外,我们通过包括成分来生成更准确的食谱来扩展模型。实验结果表明,该方法的表现优于最先进的DVC模型。我们还确认,通过以故事感知方式对食谱进行建模,提出的模型以正确的顺序输出适当数量的事件。
This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order.