Linking Entities across Relations and Graphs

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2024-01-03 DOI:10.1145/3639363

Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin

{"title":"Linking Entities across Relations and Graphs","authors":"Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin","doi":"10.1145/3639363","DOIUrl":null,"url":null,"abstract":"This paper proposes a notion of parametric simulation to link entities across a relational database \\(\\mathcal {D} \\) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in \\(\\mathcal {D} \\) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across \\(\\mathcal {D} \\) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \\(\\mathcal {D} \\) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \\(\\mathcal {D} \\) and graph G for both batch and incremental computations.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639363","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes a notion of parametric simulation to link entities across a relational database \(\mathcal {D} \) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in \(\mathcal {D} \) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across \(\mathcal {D} \) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \(\mathcal {D} \) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \(\mathcal {D} \) and graph G for both batch and incremental computations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨关系和图表链接实体

本文提出了一个参数模拟的概念，用于连接关系数据库 \(\mathcal {D} \)和图 G 中的实体。以测量顶点接近度、路径关联和重要属性的函数和阈值为参数，参数模拟根据拓扑和语义匹配，识别出 \(\mathcal {D} \)中的图元 t 和图 G 中的顶点 v，它们指的是同一个现实世界中的实体。我们开发了机器学习方法来学习参数函数和阈值。通过提供这样一种算法，我们证明了参数模拟的二次方时间。此外，我们还为参数模拟开发了一种增量算法；我们证明，相对于批量算法，增量算法是有界的，也就是说，批量算法的增量成本最小。将这些结合起来，我们开发了 HER，这是一个并行系统，可以检查（t, v）是否匹配，在 G 中找到 t 的所有顶点匹配，并在\(\mathcal {D} \)和 G 中计算所有匹配，所有这些都在二次时间内完成；此外，HER 支持根据\(\mathcal {D} \)和 G 的更新增量计算。通过使用真实数据和合成数据，我们实证验证了 HER 的准确性，其平均 F-measure 值为 0.94，并且能够随着数据库 \(\mathcal {D} \) 和图 G 的批量计算和增量计算而扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.

期刊最新文献

Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)