探索局部敏感散列作为大规模管理数据集的阻塞方法

Leah Quinn, Rachel Shipsey
{"title":"探索局部敏感散列作为大规模管理数据集的阻塞方法","authors":"Leah Quinn, Rachel Shipsey","doi":"10.23889/ijpds.v8i2.2318","DOIUrl":null,"url":null,"abstract":"ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links.
 MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH.
 Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable.
 Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting.
 ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size.
 Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables.
 ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring locality sensitive hashing as a blocking method for large-scale administrative datasets\",\"authors\":\"Leah Quinn, Rachel Shipsey\",\"doi\":\"10.23889/ijpds.v8i2.2318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links.
 MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH.
 Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable.
 Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting.
 ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size.
 Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables.
 ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.\",\"PeriodicalId\":132937,\"journal\":{\"name\":\"International Journal for Population Data Science\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal for Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v8i2.2318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal for Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i2.2318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

由于需要计算能力,链接大规模数据集是具有挑战性的。本研究探讨了在链接大型管理数据集时使用位置敏感哈希(LSH)作为阻塞方法来降低计算复杂性。LSH将类似的数据散列到“桶”中,从而减少查找链接所需的搜索空间和处理能力。 方法在方法开发过程中使用金标准链接数据集。使用金标准匹配和非匹配的样本制作测试数据集,然后使用LSH. 测试了各种LSH参数,包括瓦长、特征长度、频带尺寸和匹配频带数量。精密度和召回率被用来寻找最佳的参数来识别好的候选对,100%的召回率和20%的准确率是理想的。 对出生日期、邮政编码和性别变量的替代格式进行了测试,并使用了额外的字符来模拟协议加权。 截至2023年春季的结果很有希望,但需要注意的是,目前只测试了小数据集。具有最佳参数的LSH方法创建了约9,000对候选对,同时保持100%的召回率(即所有真实匹配都包含在候选对中)和27.6%的精度。相比之下,使用相同变量的传统确定性块方法创建了约70,000对候选对,而笛卡尔积创建了超过2340万对候选对。因此,我们已经证明了LSH可以用来显著减少搜索空间的大小。 此外,该方法很容易处理纵向数据或复合数据集中可能出现的替代名称,邮政编码等,而无需考虑变量的不同可能组合。 结论目前的研究表明,LSH可以大大减少数据链接阻塞时的搜索空间。使用可变格式来优先处理特定部分(例如邮政编码)的协议,克服了LSH的一个潜在缺点。对可变格式、参数优化和大规模测试方法的进一步研究正在进行中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Exploring locality sensitive hashing as a blocking method for large-scale administrative datasets
ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links. MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH. Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable. Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting. ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size. Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables. ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Using novel data linkage of biobank data with administrative health data to inform genomic analysis for future precision medicine treatment of congenital heart disease Common governance model: a way to avoid data segregation between existing trusted research environment Federated learning for generating synthetic data: a scoping review Health Data Governance for Research Use in Alberta Establishment of a birth-to-education cohort of 1 million Palestinian refugees using electronic medical records and electronic education records
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1