论文标题
Medlatinepi和Medlatinlit:中世纪拉丁文本的计算作者分析的两个数据集
MedLatinEpi and MedLatinLit: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts
论文作者
论文摘要
我们介绍并提供Medlatinepi和Medlatinlit,这是中世纪拉丁文本的两个数据集,用于计算作者身份分析研究。 Medlatinepi和Medlatinlit分别由294和30个策划文本组成,由作者标记; Medlatinepi文本具有书信性质,而Medlatinlit文本包括有关各种主题的文学评论和论文。因此,这两个数据集借方支持在作者分析任务中的研究,例如作者归因,作者身份验证或相同的作者验证。除了数据集外,我们还提供了在这些数据集上获得的实验结果,用于作者身份验证任务,即预测候选人作者是否编写了未知作者的文本的任务。我们还提供了我们使用的作者验证系统的源代码,从而使我们的实验可以被其他研究人员复制并用作基准。我们还用这些数据集作为培训数据来描述上述作者身份验证系统的应用,以调查两个中世纪书信的作者身份,其作者身份受到学者的争议。
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.