Analysing Siamese Neural Network Architectures for Computing Name Similarity.

IF 1.6 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.2077
Nicholas Vinden, Jérémy Foxcroft, L. Antonie
{"title":"Analysing Siamese Neural Network Architectures for Computing Name Similarity.","authors":"Nicholas Vinden, Jérémy Foxcroft, L. Antonie","doi":"10.23889/ijpds.v7i3.2077","DOIUrl":null,"url":null,"abstract":"IntroductionA wide assortment of string similarity measures can be used to determine how similar two names are. A diverse set of discriminating and independent features for name similarity are important for classification during record linkage. A Siamese neural network could surpass traditional string similarity measures for the name similarity problem. \nObjectives and ApproachThis research aims to compare a classifier based on the Siamese network architecture with a Random Forest classifier. In addition to comparing overall performance, we seek to answer whether there are any special properties of certain matching name pairs where the complexity of the Siamese network offers particular benefit. \nOur data consists of 25,000 last name pairings, with each pair being two variants of a family name. Name similarity predictions from the Siamese network are compared to a Random Forest model that serves as an ensemble of existing string similarity measures. \nResultsWe compare the similarity scores yielded by the two methods and discuss the results. We describe the representation of names to each method; name representation is computed formulaically for the traditional measures but is learned by the Siamese network during training. The comparison of different methods is made both in terms of their similarity prediction quality, and the computational cost to generate the predictions. \nAs expected, the Siamese network necessitates a significant computational cost to train. Unexpectedly, the ensemble of traditional measures yields almost identical overall classification performance. However, we expect that further analysis of false positives and false negatives will yield some insight into when practitioners should consider one method over the other. \nConclusions/ImplicationsResults suggest that there may be instances where a Siamese network outperforms other similarity measures, although training a Siamese network comes at a considerable computational cost. It is worth considering this approach to name similarity as an additional similarity feature when performing record linkage tasks.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.2077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

IntroductionA wide assortment of string similarity measures can be used to determine how similar two names are. A diverse set of discriminating and independent features for name similarity are important for classification during record linkage. A Siamese neural network could surpass traditional string similarity measures for the name similarity problem. Objectives and ApproachThis research aims to compare a classifier based on the Siamese network architecture with a Random Forest classifier. In addition to comparing overall performance, we seek to answer whether there are any special properties of certain matching name pairs where the complexity of the Siamese network offers particular benefit. Our data consists of 25,000 last name pairings, with each pair being two variants of a family name. Name similarity predictions from the Siamese network are compared to a Random Forest model that serves as an ensemble of existing string similarity measures. ResultsWe compare the similarity scores yielded by the two methods and discuss the results. We describe the representation of names to each method; name representation is computed formulaically for the traditional measures but is learned by the Siamese network during training. The comparison of different methods is made both in terms of their similarity prediction quality, and the computational cost to generate the predictions. As expected, the Siamese network necessitates a significant computational cost to train. Unexpectedly, the ensemble of traditional measures yields almost identical overall classification performance. However, we expect that further analysis of false positives and false negatives will yield some insight into when practitioners should consider one method over the other. Conclusions/ImplicationsResults suggest that there may be instances where a Siamese network outperforms other similarity measures, although training a Siamese network comes at a considerable computational cost. It is worth considering this approach to name similarity as an additional similarity feature when performing record linkage tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分析用于计算名称相似性的暹罗神经网络架构。
引言可以使用各种各样的字符串相似性度量来确定两个名称的相似程度。在记录链接过程中,名称相似性的一组不同的判别和独立特征对于分类很重要。在名称相似性问题上,暹罗神经网络可以超越传统的字符串相似性度量。目的和方法本研究旨在将基于暹罗网络架构的分类器与随机森林分类器进行比较。除了比较整体性能外,我们还试图回答某些匹配名称对是否有任何特殊性质,其中暹罗网络的复杂性提供了特别的好处。我们的数据由25000个姓氏配对组成,每个配对都是一个姓氏的两个变体。将暹罗网络的名称相似性预测与随机森林模型进行比较,该模型用作现有字符串相似性度量的集合。结果我们比较了两种方法得出的相似性得分,并对结果进行了讨论。我们描述了每个方法的名称表示;名称表示是为传统度量公式化计算的,但在训练过程中由暹罗网络学习。对不同方法的相似性预测质量和生成预测的计算成本进行了比较。正如预期的那样,暹罗网络需要大量的计算成本来进行训练。出乎意料的是,传统度量的集合产生了几乎相同的总体分类性能。然而,我们预计,对假阳性和假阴性的进一步分析将对从业者何时应该考虑一种方法而不是另一种方法产生一些见解。结论/含义结果表明,尽管训练暹罗网络需要相当大的计算成本,但暹罗网络可能在某些情况下优于其他相似性度量。在执行记录链接任务时,值得考虑将这种名称相似性方法作为额外的相似性特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
386
审稿时长
20 weeks
期刊最新文献
Defining a low-risk birth cohort: a cohort study comparing two perinatal data sets in Ontario, Canada. Data resource profile: nutrition data in the VA million veteran program. Deprivation effects on length of stay and death of hospitalised COVID-19 patients in Greater Manchester. Variation in colorectal cancer treatment and outcomes in Scotland: real world evidence from national linked administrative health data. Examining the quality and population representativeness of linked survey and administrative data: guidance and illustration using linked 1958 National Child Development Study and Hospital Episode Statistics data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1