Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory Pub Date : 2018-01-01 DOI:10.4230/LIPIcs.ICDT.2018.20

Yufei Tao

{"title":"Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space","authors":"Yufei Tao","doi":"10.4230/LIPIcs.ICDT.2018.20","DOIUrl":null,"url":null,"abstract":"In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2018.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于低维空间线性分类的大规模并行实体匹配

在实体匹配的分类,我们给出两套R和S的对象是否R和S形式以每一对匹配(R, S)在x R S .如果R和S是域的子集(R)和D (S)分别的目标是发现一个分类器函数f: D (R) x D (S) - >{0,1}从某个类的属性,每一个在R (R, S) x年代,f (R, S) = 1当且仅当R和S是匹配。过去的研究习惯于直接在R乘以s的所有标记(即匹配或不匹配)对上运行学习算法，然而，这存在一个缺点，即即使读取输入也会产生二次成本。我们追求一个消除二次势垒的方向。用T表示R乘以S的匹配对的集合。我们建议接受R、S、T作为输入，以|R|+|S|+|T|为代价来解决问题，从而在|T|<<|R||S|的(典型)场景中获得较大的性能提升。本文通过展示如何用线性分类实现实体匹配的上述目的，为新方向的可行性提供了证据，其中分类器是分离匹配对和不匹配对的线性多维平面。我们实际上是在MPC模型中这样做的，这与为大规模学习部署大规模并行计算系统的趋势相呼应。作为副产物，我们得到了三个几何问题的新的MPC算法:线性规划、批处理范围计数和优势连接。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

自引率

0.00%

发文量

期刊最新文献

Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs A Simple Algorithm for Consistent Query Answering under Primary Keys Size Bounds and Algorithms for Conjunctive Regular Path Queries Compact Data Structures Meet Databases (Invited Talk) Enumerating Subgraphs of Constant Sizes in External Memory