随机森林中的关节

论文标题

随机森林中的关节

Joints in Random Forests

论文作者

Correia, Alvaro H. C., Peharz, Robert, de Campos, Cassio

论文摘要

决策树（DTS）和随机森林（RFS）对日常机器学习从业者和数据科学家来说是重要的歧视性学习者和工具。但是，由于其歧视性质，他们缺乏处理具有缺失功能或检测异常值的输入的原则方法，这需要将它们与插补技术或单独的生成模型配对。在本文中，我们证明了DTS和RF可以自然地解释为生成模型，通过与概率电路建立连接，这是一类杰出的可处理概率模型。这种重新解释使它们在特征空间上具有完整的联合分布，并导致生成性决策树（GEDTS）和生成森林（GEFS），这是一个新型混合生成歧义模型的家族。这个模型家族保留了DTS和RF的整体特征，同时还可以通过边缘化处理缺失的功能。在某些假设下，通常是为贝叶斯一致性结果做出的，我们表明，如果随机丢失，GEDT和GEF的一致性扩展到任何缺失输入特征的模式。从经验上讲，我们表明我们的模型通常优于处理缺失数据的常见例程，例如k-nearthign邻居插补，而且我们的模型可以通过监视输入特征的边际概率来自然检测异常值。

Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative learners and tools of central importance to the everyday machine learning practitioner and data scientist. Due to their discriminative nature, however, they lack principled methods to process inputs with missing features or to detect outliers, which requires pairing them with imputation techniques or a separate generative model. In this paper, we demonstrate that DTs and RFs can naturally be interpreted as generative models, by drawing a connection to Probabilistic Circuits, a prominent class of tractable probabilistic models. This reinterpretation equips them with a full joint distribution over the feature space and leads to Generative Decision Trees (GeDTs) and Generative Forests (GeFs), a family of novel hybrid generative-discriminative models. This family of models retains the overall characteristics of DTs and RFs while additionally being able to handle missing features by means of marginalisation. Under certain assumptions, frequently made for Bayes consistency results, we show that consistency in GeDTs and GeFs extend to any pattern of missing input features, if missing at random. Empirically, we show that our models often outperform common routines to treat missing data, such as K-nearest neighbour imputation, and moreover, that our models can naturally detect outliers by monitoring the marginal probability of input features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题