论文标题

Jemma:ML4Code应用程序的可扩展Java数据集

JEMMA: An Extensible Java Dataset for ML4Code Applications

论文作者

Karmakar, Anjan, Allamanis, Miltiadis, Robbes, Romain

论文摘要

源代码的机器学习(ML4Code)是一个活跃的研究领域,需要进行广泛的实验,以发现如何最好地使用源代码的丰富结构化信息。考虑到这一点,我们介绍了Jemma,这是一种可扩展的Java数据集,用于ML4Code应用程序,该应用程序是针对ML4Code的大规模,多样化和高质量的数据集。我们使用JEMMA的目标是通过提供构建块来实验源代码模型和任务,以降低ML4Code进入障碍。 Jemma带有大量预处理的信息,例如元数据,表示(例如代码令牌,ASTS,图形),以及来自50,000 Java项目的50,000个Java项目的几个属性(例如,指标,静态分析结果),具有超过1200万个类别,具有超过1200万个类别的类别和超过800万种方法。 Jemma也可扩展,允许用户在数据集中添加新的属性和表示形式,并在其上评估任务。因此,Jemma成为一个工作台,研究人员可以用它来实验新的表示和在源代码上操作的任务。为了证明数据集的实用性,我们还报告了有关数据的两项经验研究的结果,最终表明,重要的工作在于上下文感知源代码模型的设计,这些源代码模型可以通过软件项目中的更广泛的源代码实体网络进行推理,这是JEMMA旨在提供帮助的任务。

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源