A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks

medRxiv - Health Informatics Pub Date : 2024-08-20 DOI:10.1101/2024.08.14.24312010

Shelly Soffer, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander Charney, Girish Nadkarni, Eyal Klang

{"title":"A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks","authors":"Shelly Soffer, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander Charney, Girish Nadkarni, Eyal Klang","doi":"10.1101/2024.08.14.24312010","DOIUrl":null,"url":null,"abstract":"Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"256 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.14.24312010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为语义医疗任务建立嵌入模型基准的可扩展框架

文本嵌入将文本信息转换为数字表示，使机器能够执行信息检索等语义任务。尽管文本嵌入很有潜力，但其在医疗保健领域的应用还未得到充分探索，部分原因是缺乏使用生物医学数据的基准研究。本研究提供了一个灵活的框架，用于对嵌入模型进行基准测试，以确定那些对医疗保健相关语义任务最有效的嵌入模型。我们从多语言文本嵌入基准（MTEB）Hugging Face 资源中选择了 30 个嵌入模型，这些模型具有不同的参数大小和架构。我们使用真实世界的语义检索医疗任务对模型进行了测试，测试对象包括：（1）PubMed 摘要；（2）由 Llama-3-70b 模型生成的合成电子健康记录（EHR）；（3）来自西奈山健康系统的真实世界患者数据；以及（4）MIMIC IV 数据库。任务分为 "短任务 "和 "长任务"。"短任务 "涉及简短的文本配对交互，例如分诊记录和主诉；"长任务 "需要处理扩展文档，例如进展记录和病史& 体检记录。我们利用斯皮尔曼相关性将模型的性能与数据完整性水平相关联，对模型进行了评估，数据完整性水平从 0%（完全不匹配的配对）到 100%（完全匹配的配对）不等。此外，我们还检查了各任务的平均斯皮尔曼分数与两个 MTEB排行榜基准（记录的总平均分和语义文本相似性 (STS) 平均分）之间的相关性。我们在七个临床任务（每个任务涉及 2,000 个文本对）和五个数据完整性级别中对 30 个嵌入模型进行了评估，共进行了 210 万次比较。一些模型始终表现出色，而基于 Mistral-7b 的模型在长文本任务中表现出色。尽管 NV-Embed-v1 在短任务中表现最佳，但在长任务中的表现却不尽如人意。我们的平均任务性能得分（ATPS）与 MTEB STS 得分（0.73）的相关性优于 MTEB 平均得分（0.67）。所建议的框架具有灵活性、可扩展性，并能抵御已发布基准上模型过度拟合的风险。采用这种方法可以改进医疗保健领域的嵌入式技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

medRxiv - Health Informatics

自引率

0.00%

发文量