一项有关嵌入模型绩效的句子调查以进行专利分析

论文标题

一项有关嵌入模型绩效的句子调查以进行专利分析

A Survey on Sentence Embedding Models Performance for Patent Analysis

论文作者

Bekamiri, Hamid, Hain, Daniel S., Jurowetzki, Roman

论文摘要

专利数据是创新研究知识的重要来源，而专利对之间的技术相似性是专利分析的关键指标。最近，研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性，以帮助更好地了解创新，专利景观，技术映射和专利质量评估。通常，文本嵌入是专利分析任务的重要先驱。然后出现一个相关问题：我们应该如何测量和评估这些嵌入的准确性？据我们所知，尚无综合调查可以清楚地描述嵌入模型的性能来计算专利相似指标。因此，在这项研究中，我们根据专利分类性能概述了这些算法的准确性，并提出了一个标准库和数据集，以评估基于专利的方法的嵌入模型的准确性。在详细的讨论中，我们报告了部分，班级和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明，专利，贝特（Bert-For）和tf-idf加权单词嵌入的结果是在亚类级别计算句子嵌入的最佳精度。根据第一个结果，不同类别中模型的性能各不相同，这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。

Patent data is an important source of knowledge for innovation research, while the technological similarity between pairs of patents is a key enabling indicator for patent analysis. Recently researchers have been using patent vector space models based on different NLP embeddings models to calculate the technological similarity between pairs of patents to help better understand innovations, patent landscaping, technology mapping, and patent quality evaluation. More often than not, Text Embedding is a vital precursor to patent analysis tasks. A pertinent question then arises: How should we measure and evaluate the accuracy of these embeddings? To the best of our knowledge, there is no comprehensive survey that builds a clear delineation of embedding models' performance for calculating patent similarity indicators. Therefore, in this study, we provide an overview of the accuracy of these algorithms based on patent classification performance and propose a standard library and dataset for assessing the accuracy of embeddings models based on PatentSBERTa approach. In a detailed discussion, we report the performance of the top 3 algorithms at section, class, and subclass levels. The results based on the first claim of patents show that PatentSBERTa, Bert-for-patents, and TF-IDF Weighted Word Embeddings have the best accuracy for computing sentence embeddings at the subclass level. According to the first results, the performance of the models in different classes varies, which shows researchers in patent analysis can utilize the results of this study to choose the best proper model based on the specific section of patent data they used.

下载PDF全文

下载文献需遵守相关版权规定

论文标题