xMEN: a modular toolkit for cross-lingual medical entity normalization.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES JAMIA Open Pub Date : 2024-12-26 eCollection Date: 2025-02-01 DOI:10.1093/jamiaopen/ooae147
Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P Schapranow
{"title":"xMEN: a modular toolkit for cross-lingual medical entity normalization.","authors":"Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P Schapranow","doi":"10.1093/jamiaopen/ooae147","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.</p><p><strong>Materials and methods: </strong>We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.</p><p><strong>Results: </strong>xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.</p><p><strong>Discussion: </strong>We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.</p><p><strong>Conclusion: </strong>xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae147"},"PeriodicalIF":2.5000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671143/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooae147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.

Materials and methods: We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.

Results: xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.

Discussion: We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.

Conclusion: xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
xMEN:用于跨语言医疗实体规范化的模块化工具包。
目的:提高跨多种语言的医疗实体规范化性能,特别是在语言资源比英语少的情况下。材料和方法:我们提出了xMEN,一个跨语言(x)医疗实体规范化(MEN)的模块化系统,可适应低资源和高资源场景。为了解释许多目标语言和术语的别名的稀缺性,我们通过跨语言候选生成来利用多语言别名。对于候选排序,如果目标任务的注释可用,我们将合并一个可训练的交叉编码器(CE)模型。为了平衡通用候选生成器和后续可训练的重新排序器的输出,我们在训练ce的损失函数中引入了一个新的秩正则化项。为了在没有金标准注释的情况下重新排序,我们使用机器翻译和投影来自高资源语言的注释引入了多个新的弱标记数据集。结果:xMEN在几种欧洲语言的各种基准数据集上提高了最先进的性能。当目标任务没有训练数据时,弱监督ce是有效的。讨论:我们对规范化误差进行了分析,揭示了复杂实体在规范化方面仍然具有挑战性。新的模块和基准数据集可以很容易地集成在未来。结论:xMEN在许多语言的医疗实体规范化方面表现出很强的性能,即使在没有标记数据和目标语言的术语别名可用的情况下也是如此。为了在将来实现可重复的基准测试,我们将该系统作为开源Python工具包提供。预训练的模型和源代码可在网上获得:https://github.com/hpi-dhc/xmen。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
期刊最新文献
Aligning prediction models with clinical information needs: infant sepsis case study. Semantic enrichment of Pomeranian health study data using LOINC and WHO-FIC terminology mapping principles. Exploring beyond diagnoses in electronic health records to improve discovery: a review of the phenome-wide association study. Toward digital caregiving network interventions for children with medical complexity living in socioeconomically disadvantaged neighborhoods. Transforming appeal decisions: machine learning triage for hospital admission denials.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1