Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.

IF 1.6 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2025-03-11 eCollection Date: 2023-01-01 DOI:10.23889/ijpds.v8i5.2935
Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron
{"title":"Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.","authors":"Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron","doi":"10.23889/ijpds.v8i5.2935","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 5","pages":"2935"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897931/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i5.2935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
386
审稿时长
20 weeks
期刊最新文献
The application of population data linkage to capture sibling health outcomes among children and young adults with neurodevelopmental conditions. A scoping review. Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage. Cohort Profile Update: Reflecting back and looking ahead: Updating the Comparative Outcomes and Service Utilization Trends (COAST) Study to include 28 years of linked data from people with and without HIV in British Columbia, Canada. Using a deterministic matching computer routine to identify hospital episodes in a Brazilian de-identified administrative database for the analysis of obstetrics hospitalisations. Addressing uncertainty in identifying pregnancies in the English CPRD GOLD Pregnancy Register: a methodological study using a worked example.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1