Scikit-Learn管道的演变具有动态结构化语法演化

论文标题

Scikit-Learn管道的演变具有动态结构化语法演化

Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical Evolution

论文作者

Assunção, Filipe, Lourenço, Nuno, Ribeiro, Bernardete, Machado, Penousal

论文摘要

机器学习（ML）模型的部署是一项困难且耗时的工作，包括一系列的顺序和相关任务，这些任务从数据预处理以及功能的设计和提取到ML算法的选择，并选择ML算法及其参数化。考虑到功能的设计在许多情况下是特定于问题的，因此任务更具挑战性，因此需要域名。为了克服这些限制，自动化机器学习（AUTOML）方法试图在很少或没有人工干预的情况下自动化管道的设计，即自动选择必须应用于原始数据的方法序列的选择。这些方法有可能使非专家用户使用ML，并为专家用户提供他们不太可能考虑的解决方案。特别是，本文描述了Automl-DSGE-一种新型的基于语法的框架，该框架将动态结构化语法演化（DSGE）适应Scikit-Learn Granagn Classification Pipeline的进化。实验结果包括将AutoMl-DSGE与另一个基于语法的Automl框架，弹性分类Pipeline Evolution（食谱）进行比较，并表明由Automl-DSGE产生的分类管道的平均性能始终优于食谱的平均性能；在10个使用的数据集中的3个中，这些差异在统计学上具有统计学意义。

The deployment of Machine Learning (ML) models is a difficult and time-consuming job that comprises a series of sequential and correlated tasks that go from the data pre-processing, and the design and extraction of features, to the choice of the ML algorithm and its parameterisation. The task is even more challenging considering that the design of features is in many cases problem specific, and thus requires domain-expertise. To overcome these limitations Automated Machine Learning (AutoML) methods seek to automate, with few or no human-intervention, the design of pipelines, i.e., automate the selection of the sequence of methods that have to be applied to the raw data. These methods have the potential to enable non-expert users to use ML, and provide expert users with solutions that they would unlikely consider. In particular, this paper describes AutoML-DSGE - a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines. The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE), and show that the average performance of the classification pipelines generated by AutoML-DSGE is always superior to the average performance of RECIPE; the differences are statistically significant in 3 out of the 10 used datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题