记录融合：一种学习方法

论文标题

记录融合：一种学习方法

Record fusion: A learning approach

论文作者

Heidari, Alireza, Michalopoulos, George, Kushagra, Shrinu, Ilyas, Ihab F., Rekatsinas, Theodoros

论文摘要

记录融合是汇总与数据库中同一现实世界实体相对应的多个记录的任务。我们可以将记录融合视为一个机器学习问题，其目标是为每个实体预测每个属性的“正确”值。给定数据库，我们使用属性级，记录级和数据库级信号的组合来为该数据库的每个单元格（或（Row，col））构造功能向量。我们使用此功能向量以及基础真实信息来学习数据库每个属性的分类器。我们的学习算法使用一种新颖的舞台添加剂模型。在每个阶段，我们通过将原始特征向量的一部分与上一个阶段的预测计算出的特征相结合来构建一个新的特征向量。然后，我们在新功能空间上学习了软性示物分类器。这种贪婪的阶段方法可以看作是一个深层模型，在每个阶段，我们正在添加原始特征向量的更复杂的非线性转换。我们表明，当有记录的源信息可用时，我们的方法将记录的平均精度为〜98％，而〜94％的记录则在各种各样的现实世界数据集中没有源信息。我们将我们的方法与文献中考虑的数据融合和实体合并方法的全面收集进行了比较。我们表明，我们的方法可以分别在/没有源信息的情况下，可以实现约20％/〜45％的平均精度提高。

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of ~98% when source information of records is available, and ~94% without source information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of ~20%/~45% with/without source information respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题