通过使用有限状态机的乌兹别克语言开发基于规则的柠檬酸算法

论文标题

通过使用有限状态机的乌兹别克语言开发基于规则的柠檬酸算法

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

论文作者

Sharipov, Maksud, Sobirov, Ogabek

论文摘要

诱饵是自然语言处理中的核心概念之一，因此创建诱饵工具是一项重要任务。本文讨论了乌兹别克语语言的柠檬酸化算法的构建。这项工作的主要目的是通过有限的状态机删除乌兹别克语语言中的单词词缀，并识别单词的引理（在字典中可以找到的单词）。删除词缀的过程使用词缀数据库和语音知识的一部分。这种诱饵由乌兹别克语语言的语音数据，词缀，词缀分类，根据每个类的有限状态机删除词缀的一部分，以及该单词Lemma的定义。

Lemmatization is one of the core concepts in natural language processing, thus creating a lemmatization tool is an important task. This paper discusses the construction of a lemmatization algorithm for the Uzbek language. The main purpose of the work is to remove affixes of words in the Uzbek language by means of the finite state machine and to identify a lemma (a word that can be found in the dictionary) of the word. The process of removing affixes uses a database of affixes and part of speech knowledge. This lemmatization consists of the general rules and a part of speech data of the Uzbek language, affixes, classification of affixes, removing affixes on the basis of the finite state machine for each class, as well as a definition of this word lemma.

下载PDF全文

下载文献需遵守相关版权规定

论文标题