探索局部敏感散列作为大规模管理数据集的阻塞方法

International Journal for Population Data Science Pub Date : 2023-09-14 DOI:10.23889/ijpds.v8i2.2318

Leah Quinn, Rachel Shipsey

{"title":"探索局部敏感散列作为大规模管理数据集的阻塞方法","authors":"Leah Quinn, Rachel Shipsey","doi":"10.23889/ijpds.v8i2.2318","DOIUrl":null,"url":null,"abstract":"ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links. MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH. Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable. Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting. ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size. Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables. ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.","PeriodicalId":132937,"journal":{"name":"International Journal for Population Data Science","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring locality sensitive hashing as a blocking method for large-scale administrative datasets\",\"authors\":\"Leah Quinn, Rachel Shipsey\",\"doi\":\"10.23889/ijpds.v8i2.2318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links. MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH. Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable. Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting. ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size. Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables. ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.\",\"PeriodicalId\":132937,\"journal\":{\"name\":\"International Journal for Population Data Science\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal for Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v8i2.2318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal for Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i2.2318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由于需要计算能力，链接大规模数据集是具有挑战性的。本研究探讨了在链接大型管理数据集时使用位置敏感哈希(LSH)作为阻塞方法来降低计算复杂性。LSH将类似的数据散列到“桶”中，从而减少查找链接所需的搜索空间和处理能力。方法在方法开发过程中使用金标准链接数据集。使用金标准匹配和非匹配的样本制作测试数据集，然后使用LSH. 测试了各种LSH参数，包括瓦长、特征长度、频带尺寸和匹配频带数量。精密度和召回率被用来寻找最佳的参数来识别好的候选对，100%的召回率和20%的准确率是理想的。对出生日期、邮政编码和性别变量的替代格式进行了测试，并使用了额外的字符来模拟协议加权。截至2023年春季的结果很有希望，但需要注意的是，目前只测试了小数据集。具有最佳参数的LSH方法创建了约9,000对候选对，同时保持100%的召回率(即所有真实匹配都包含在候选对中)和27.6%的精度。相比之下，使用相同变量的传统确定性块方法创建了约70,000对候选对，而笛卡尔积创建了超过2340万对候选对。因此，我们已经证明了LSH可以用来显著减少搜索空间的大小。此外，该方法很容易处理纵向数据或复合数据集中可能出现的替代名称，邮政编码等，而无需考虑变量的不同可能组合。结论目前的研究表明，LSH可以大大减少数据链接阻塞时的搜索空间。使用可变格式来优先处理特定部分(例如邮政编码)的协议，克服了LSH的一个潜在缺点。对可变格式、参数优化和大规模测试方法的进一步研究正在进行中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploring locality sensitive hashing as a blocking method for large-scale administrative datasets

ObjectivesLinking large-scale datasets is challenging due to the computational power required. This research explores using Locality-Sensitive-Hashing (LSH) as a blocking method to reduce the computational complexity when linking large administrative datasets. LSH hashes similar data into ‘buckets’, thus reducing the search space and processing power required to find links. MethodsA gold-standard linked dataset was used during method development. Test datasets were made using samples of gold-standard matches and non-matches, then blocked using LSH. Various LSH parameters including shingle length, signature length, band size and number of matching bands were tested. Precision and recall were used to find optimal parameters for identifying good candidate pairs, with 100% recall and >20% precision being desirable. Alternative formats for date of birth, postcode and gender variables were tested, with additional characters used to simulate agreement weighting. ResultsResults as of spring 2023 are promising, with the caveat that currently only small datasets have been tested. The LSH method with optimal parameters creates ~9,000 candidate pairs whilst maintaining recall of 100% (i.e., all true matches are included in the candidate pairs) and precision of 27.6%. In contrast, our traditional deterministic blocking method using the same variables creates ~70,000 candidate pairs, and a cartesian product creates over 23.4 million candidate pairs. We have therefore shown that LSH can be used to create a significant reduction in the search-space size. Furthermore, the method easily handles alternative names, postcodes, etc. that may be present in longitudinal data or composite datasets, with no need to account for different possible combinations of variables. ConclusionCurrent research has shown that LSH can be used to drastically reduce the search space when blocking for data linkage. Using variable formatting to prioritise agreement for specific sections e.g., of postcode, has overcome a potential downside of LSH. Further research on variable formatting, parameter optimisation and testing of the method at scale is ongoing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal for Population Data Science

自引率

0.00%

发文量