论文标题
通过融合低维特征的自动音频字幕
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features
论文作者
论文摘要
自动音频字幕(AAC)旨在使用简单的句子来描述音频剪辑的内容。现有的AAC方法是基于编码器 - 模块架构开发的,该架构的使用归因于使用预训练的CNN10(称为PANNS)作为学习丰富音频表示形式的编码器。 AAC是一项极具挑战性的任务,因为其高维度空间涉及各种情况的音频。现有方法仅将PANN的高维表示作为解码器的输入。但是,低维表示可以保留与高维表示一样多的音频信息。此外,尽管高维方法可以通过从现有音频字幕中学习,这可能会预测音频字幕,这缺乏稳健性和效率。为了应对这些挑战,提出了一个集成了低维特征AAC框架的融合模型。在本文中,提出了一个新的编码器框架,称为AAC的低维特征融合(LHDFF)模型。此外,在LHDFF中,提出了一个新的PANN编码器,称为“残留pan”(RPANNS),通过融合来自中间卷积层输出的低维特征和来自PANN最终层输出的高维特征。为了完全探索低维融合功能和高维特征的信息,我们提出了双变压器解码器结构,以并行生成字幕。特别是,提出了一种概率融合方法,可以通过专注于两个变压器解码器的各自优势来确保系统的整体性能得到改善。实验结果表明,与其他现有型号相比
Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of various scenarios. Existing methods only use the high-dimensional representation of the PANNs as the input of the decoder. However, the low-dimension representation may retain as much audio information as the high-dimensional representation may be neglected. In addition, although the high-dimensional approach may predict the audio captions by learning from existing audio captions, which lacks robustness and efficiency. To deal with these challenges, a fusion model which integrates low- and high-dimensional features AAC framework is proposed. In this paper, a new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs. To fully explore the information of the low- and high-dimensional fusion feature and high-dimensional feature respectively, we proposed dual transformer decoder structures to generate the captions in parallel. Especially, a probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders. Experimental results show that LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models