Linking Entities across Relations and Graphs

IF 2.2 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2024-01-03 DOI:10.1145/3639363
Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin
{"title":"Linking Entities across Relations and Graphs","authors":"Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin","doi":"10.1145/3639363","DOIUrl":null,"url":null,"abstract":"<p>This paper proposes a notion of parametric simulation to link entities across a relational database \\(\\mathcal {D} \\) and a graph <i>G</i>. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples <i>t</i> in \\(\\mathcal {D} \\) and vertices <i>v</i> in <i>G</i> that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, <i>i.e.,</i> it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (<i>t</i>, <i>v</i>) makes a match, find all vertex matches of <i>t</i> in <i>G</i>, and compute all matches across \\(\\mathcal {D} \\) and <i>G</i>, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \\(\\mathcal {D} \\) and <i>G</i>. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \\(\\mathcal {D} \\) and graph <i>G</i> for both batch and incremental computations.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639363","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper proposes a notion of parametric simulation to link entities across a relational database \(\mathcal {D} \) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in \(\mathcal {D} \) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across \(\mathcal {D} \) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \(\mathcal {D} \) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \(\mathcal {D} \) and graph G for both batch and incremental computations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
跨关系和图表链接实体
本文提出了一个参数模拟的概念,用于连接关系数据库 \(\mathcal {D} \)和图 G 中的实体。以测量顶点接近度、路径关联和重要属性的函数和阈值为参数,参数模拟根据拓扑和语义匹配,识别出 \(\mathcal {D} \)中的图元 t 和图 G 中的顶点 v,它们指的是同一个现实世界中的实体。我们开发了机器学习方法来学习参数函数和阈值。通过提供这样一种算法,我们证明了参数模拟的二次方时间。此外,我们还为参数模拟开发了一种增量算法;我们证明,相对于批量算法,增量算法是有界的,也就是说,批量算法的增量成本最小。将这些结合起来,我们开发了 HER,这是一个并行系统,可以检查(t, v)是否匹配,在 G 中找到 t 的所有顶点匹配,并在\(\mathcal {D} \)和 G 中计算所有匹配,所有这些都在二次时间内完成;此外,HER 支持根据\(\mathcal {D} \)和 G 的更新增量计算。通过使用真实数据和合成数据,我们实证验证了 HER 的准确性,其平均 F-measure 值为 0.94,并且能够随着数据库 \(\mathcal {D} \) 和图 G 的批量计算和增量计算而扩展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Database Systems
ACM Transactions on Database Systems 工程技术-计算机:软件工程
CiteScore
5.60
自引率
0.00%
发文量
15
审稿时长
>12 weeks
期刊介绍: Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.
期刊最新文献
Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1