论文标题
嵌入生成巴西葡萄牙用户评论的文本分类:从字袋到变形金刚
Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers
论文作者
论文摘要
文本分类是与许多商业应用程序有关的自然语言处理(NLP)任务,例如电子商务和客户服务。自然地,准确地对此类摘录进行分类通常代表着一个挑战,因为诸如讽刺和细微差别之类的固有语言方面。为了完成这项任务,必须为文档提供强大的数值表示形式,该过程称为嵌入。当今,嵌入代表了一个关键的NLP字段,在过去十年中已经取得了重大进步,尤其是在引入了单词到矢量概念以及解决NLP任务的深度学习模型之后,包括卷积神经网络(CNNS),重新出现的神经网络(RNNS)(RNNS)和基于变形金程序的语言模型(TLMS)。尽管该领域取得了令人印象深刻的成就,但有关为巴西葡萄牙文本生成嵌入的文献覆盖范围很少,尤其是在考虑商业用户评论时。因此,这项工作旨在提供针对巴西葡萄牙用户评论二元性情绪分类的嵌入方法的全面实验研究。这项研究包括从古典(单词袋)到最先进的(基于变压器的)NLP模型。使用五个开源数据库评估了这些方法,并在开放的数字存储库中提供了预定的数据分区,以鼓励可重复性。根据分析的数据库,微调的TLMS在所有情况下都取得了所有情况,其后是基于功能的TLM,LSTM和CNN,其等级为替代等级。
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.