将部分秩阻塞扩展到数十亿实体的连接组件

Tobias Backes, Stefan Dietze
{"title":"将部分秩阻塞扩展到数十亿实体的连接组件","authors":"Tobias Backes, Stefan Dietze","doi":"10.1145/3646553","DOIUrl":null,"url":null,"abstract":"\n In entity resolution,\n blocking\n pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related\n blocking-keys\n . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset\n partial\n order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.\n","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"10 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Connected Components for Scaling Partial-Order Blocking to Billion Entities\",\"authors\":\"Tobias Backes, Stefan Dietze\",\"doi\":\"10.1145/3646553\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n In entity resolution,\\n blocking\\n pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related\\n blocking-keys\\n . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset\\n partial\\n order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.\\n\",\"PeriodicalId\":517209,\"journal\":{\"name\":\"Journal of Data and Information Quality\",\"volume\":\"10 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3646553\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3646553","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在实体解析中,分块会预先分割数据,以便用更昂贵的方法进一步处理。如果两个实体提到的内容共享相同或相关的分块键,那么它们就在同一个分块中。以前的工作有时会通过分组或按字母顺序排序的方式将屏蔽键联系起来,但正如作者消歧所显示的那样,各自的等价关系或总顺序并不一定适合屏蔽键之间的逻辑匹配关系建模。为了解决这个问题,我们提出了一种新颖的屏蔽方法,它利用实体表征的子集部分顺序来构建基于匹配的双方图,并使用连接的组件作为屏蔽。为防止连接过度和连接不足,我们允许对过于一般的表征进行规范,并对过于特殊的表征进行概括。为了构建双方图,我们贡献了一种新的并行化算法,该算法具有可配置的时间/空间权衡功能,可在子集部分顺序中进行最小元素搜索。作为一种基于作业的方法,它结合了动态可扩展性和更简易的集成性,使其比之前描述的方法更方便。在出版记录、作者提及和隶属关系字符串的大型黄金标准上进行的实验表明,我们的方法在性能上具有竞争力,并能更好地解决特定领域的问题。在重复检测和作者消歧方面,我们的方法达到了在同一数据集上的另一项工作中使用的向量相似性基线和常见姓氏首字母基线所定义的预期性能。在顶层机构解析方面,我们重现了之前工作中描述的挑战,进一步得出结论:对于隶属关系数据,最小元素下的重叠块比连接组件更合适。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Connected Components for Scaling Partial-Order Blocking to Billion Entities
In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model Active Learning for Data Quality Control: A Survey Data Validation Utilizing Expert Knowledge and Shape Constraints Editorial: Special Issue on Human in the Loop Data Curation Editor-in-Chief (June 2017–November 2023) Farewell Report
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1