法律领域的预培训语言模型：关于印度法律的案例研究

论文标题

法律领域的预培训语言模型：关于印度法律的案例研究

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

论文作者

Paul, Shounak, Mandal, Arpan, Goyal, Pawan, Ghosh, Saptarshi

论文摘要

法律领域中的NLP随着根据法律文本预先培训的基于变压器的预训练语言模型（PLM）的出现，其成功越来越大。经过欧洲和美国法律文本培训的PLM可以公开获得；但是，来自印度等其他领域（国家）的法律文本具有很多不同的特征。随着在各个国家的合法NLP应用程序的迅速增加，也有必要在其他国家的法律文本上预先培训此类LMS。在这项工作中，我们试图调查印度法律领域的预培训。我们对印度法律数据进行了两个受欢迎的法律PLM的培训（继续进行预培训），并在印度法律数据上进行了培训，并通过基于印度法律文本的词汇从头开始训练模型。我们将这些PLM在三个基准法律NLP任务上应用于印度和非印度（英国欧盟）数据集的三个基准法律NLP任务 - 事实，法院判决文件的语义细分以及法院上诉判决预测的法律法规识别。我们观察到，我们的方法不仅可以提高新领域（印度文本）的性能，还可以提高原始领域（欧洲和英国文本）的性能。我们还进行了解释性实验，以对所有这些不同的PLM进行定性比较。

NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks -- Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction -- over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题