具有时间和空间约束的高效种群记录关联。

IF 1.6 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.1854
C. Nanayakkara, P. Christen
{"title":"具有时间和空间约束的高效种群记录关联。","authors":"C. Nanayakkara, P. Christen","doi":"10.23889/ijpds.v7i3.1854","DOIUrl":null,"url":null,"abstract":"ObjectivesPopulation databases containing birth, death, and marriage certificates or census records, are increasingly used for studies in a variety of research domains. Their large scale and complexity make linking such databases highly challenging. We present a scalable blocking and linking technique that exploits temporal and spatial constraints in personal data. \nApproachBased on a state-of-the-art blocking method using locality sensitive hashing (LSH), we incorporate (a) attribute similarities, (b) temporal constraints (for example, a mother cannot give birth to two babies less than nine months apart, besides a multiple birth), and (c) spatial constraints (two births by the same mother are more likely to happen in the same location than far apart). In an iterative fashion, we identify highly confident matches first, and use these matches to further refine our constraints. We adopt a block size and frequency-based filtering approach to further enhance the efficiency of the record linkage comparison step. \nResultsWe conducted experiments on a Scottish data set containing 17,613 birth certificates from 1861 to 1901, where the application of standard LSH blocking generated approximately 15 million candidate record pairs, with a recall of 0.999 and a precision of 0.003. With the application of our block size and frequency-based filtering approach we obtained a ten-fold and hundred-fold reduction of this candidate record pair set with a small reduction of recall to 0.984 and 0.962, respectively. The comparison of record pairs in the hundred-fold reduction using our iterative linking technique achieved up-to 0.961 precision and 0.811 recall. This means that our method can achieve a reduction in computational efforts, and improvement in precision of over 99% at the cost of a decline in recall below 19%. \nConclusionWe presented a method to reduce the computational complexity of linking large and complex population databases while ensuring high linkage quality. Our method can be generalised to population databases where temporal and spatial constraints can be defined. We plan to apply our method on a Scottish database with 24 million records.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient population record linkage with temporal and spatial constraints.\",\"authors\":\"C. Nanayakkara, P. Christen\",\"doi\":\"10.23889/ijpds.v7i3.1854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectivesPopulation databases containing birth, death, and marriage certificates or census records, are increasingly used for studies in a variety of research domains. Their large scale and complexity make linking such databases highly challenging. We present a scalable blocking and linking technique that exploits temporal and spatial constraints in personal data. \\nApproachBased on a state-of-the-art blocking method using locality sensitive hashing (LSH), we incorporate (a) attribute similarities, (b) temporal constraints (for example, a mother cannot give birth to two babies less than nine months apart, besides a multiple birth), and (c) spatial constraints (two births by the same mother are more likely to happen in the same location than far apart). In an iterative fashion, we identify highly confident matches first, and use these matches to further refine our constraints. We adopt a block size and frequency-based filtering approach to further enhance the efficiency of the record linkage comparison step. \\nResultsWe conducted experiments on a Scottish data set containing 17,613 birth certificates from 1861 to 1901, where the application of standard LSH blocking generated approximately 15 million candidate record pairs, with a recall of 0.999 and a precision of 0.003. With the application of our block size and frequency-based filtering approach we obtained a ten-fold and hundred-fold reduction of this candidate record pair set with a small reduction of recall to 0.984 and 0.962, respectively. The comparison of record pairs in the hundred-fold reduction using our iterative linking technique achieved up-to 0.961 precision and 0.811 recall. This means that our method can achieve a reduction in computational efforts, and improvement in precision of over 99% at the cost of a decline in recall below 19%. \\nConclusionWe presented a method to reduce the computational complexity of linking large and complex population databases while ensuring high linkage quality. Our method can be generalised to population databases where temporal and spatial constraints can be defined. We plan to apply our method on a Scottish database with 24 million records.\",\"PeriodicalId\":36483,\"journal\":{\"name\":\"International Journal of Population Data Science\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2022-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v7i3.1854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.1854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

目的包含出生、死亡和结婚证书或人口普查记录的人口数据库越来越多地用于各种研究领域的研究。它们的规模和复杂性使得连接这样的数据库非常具有挑战性。我们提出了一种可扩展的阻塞和链接技术,利用个人数据的时间和空间限制。方法基于使用局部敏感散列(LSH)的最先进的阻塞方法,我们结合了(a)属性相似性,(b)时间约束(例如,除了多胞胎外,一个母亲不能生两个相隔不到9个月的婴儿)和(c)空间约束(同一母亲的两个孩子更有可能发生在同一位置,而不是相隔很远)。在迭代的方式中,我们首先确定高度自信的匹配,并使用这些匹配进一步细化我们的约束。我们采用基于块大小和频率的滤波方法来进一步提高记录链接比较步骤的效率。结果我们对包含17613份1861 - 1901年苏格兰出生证明的数据集进行了实验,其中应用标准LSH块生成了大约1500万个候选记录对,召回率为0.999,精度为0.003。通过应用我们的块大小和基于频率的过滤方法,我们获得了该候选记录对集的10倍和100倍减少,召回率分别降低到0.984和0.962。使用我们的迭代链接技术对百倍还原中的记录对进行比较,达到了0.961的精度和0.811的召回率。这意味着我们的方法可以减少计算工作量,并以召回率下降到19%以下为代价将精度提高到99%以上。结论提出了一种在保证高链接质量的同时降低大型复杂人口数据库连接计算复杂度的方法。我们的方法可以推广到可以定义时间和空间约束的人口数据库。我们计划将我们的方法应用于一个拥有2400万条记录的苏格兰数据库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Efficient population record linkage with temporal and spatial constraints.
ObjectivesPopulation databases containing birth, death, and marriage certificates or census records, are increasingly used for studies in a variety of research domains. Their large scale and complexity make linking such databases highly challenging. We present a scalable blocking and linking technique that exploits temporal and spatial constraints in personal data. ApproachBased on a state-of-the-art blocking method using locality sensitive hashing (LSH), we incorporate (a) attribute similarities, (b) temporal constraints (for example, a mother cannot give birth to two babies less than nine months apart, besides a multiple birth), and (c) spatial constraints (two births by the same mother are more likely to happen in the same location than far apart). In an iterative fashion, we identify highly confident matches first, and use these matches to further refine our constraints. We adopt a block size and frequency-based filtering approach to further enhance the efficiency of the record linkage comparison step. ResultsWe conducted experiments on a Scottish data set containing 17,613 birth certificates from 1861 to 1901, where the application of standard LSH blocking generated approximately 15 million candidate record pairs, with a recall of 0.999 and a precision of 0.003. With the application of our block size and frequency-based filtering approach we obtained a ten-fold and hundred-fold reduction of this candidate record pair set with a small reduction of recall to 0.984 and 0.962, respectively. The comparison of record pairs in the hundred-fold reduction using our iterative linking technique achieved up-to 0.961 precision and 0.811 recall. This means that our method can achieve a reduction in computational efforts, and improvement in precision of over 99% at the cost of a decline in recall below 19%. ConclusionWe presented a method to reduce the computational complexity of linking large and complex population databases while ensuring high linkage quality. Our method can be generalised to population databases where temporal and spatial constraints can be defined. We plan to apply our method on a Scottish database with 24 million records.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
386
审稿时长
20 weeks
期刊最新文献
Neonates With In-Utero SSRI Exposure (NeoWISE): a retrospective cohort study examining the effect of newborn feeding method on newborn withdrawal. Secondary use of routinely collected administrative health data for epidemiologic research: Answering research questions using data collected for a different purpose. Validity of heart failure diagnoses, treatments, and readmissions in the Danish National Patient Registry. Creating an 11-year longitudinal substance use harm cohort from linked health and census data to analyse social drivers of health. Research data use in a digital society: a deliberative public engagement.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1