Comprehensive testing of large language models for extraction of structured data in pathology.

IF 5.4 Q1 MEDICINE, RESEARCH & EXPERIMENTAL Communications medicine Pub Date : 2025-03-31 DOI:10.1038/s43856-025-00808-8
Bastian Grothey, Jan Odenkirchen, Adnan Brkic, Birgid Schömig-Markiefka, Alexander Quaas, Reinhard Büttner, Yuri Tolkach
{"title":"Comprehensive testing of large language models for extraction of structured data in pathology.","authors":"Bastian Grothey, Jan Odenkirchen, Adnan Brkic, Birgid Schömig-Markiefka, Alexander Quaas, Reinhard Büttner, Yuri Tolkach","doi":"10.1038/s43856-025-00808-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed.</p><p><strong>Methods: </strong>We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios.</p><p><strong>Results: </strong>Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment.</p><p><strong>Conclusions: </strong>Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.</p>","PeriodicalId":72646,"journal":{"name":"Communications medicine","volume":"5 1","pages":"96"},"PeriodicalIF":5.4000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11958830/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s43856-025-00808-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed.

Methods: We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios.

Results: Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment.

Conclusions: Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
病理学中结构化数据提取的大型语言模型的综合测试。
背景:病理部门生成大量的非结构化数据作为自由文本诊断报告。将这些报告转换为用于分析或人工智能项目的结构化格式需要专业人员进行大量的手工工作。虽然最近的研究表明,使用先进的语言模型来构建病理数据是有希望的,但它们主要依赖于专有模型,这增加了成本和隐私问题。此外,在消费级硬件上部署的快速工程和模型量化等重要方面仍未得到解决。方法:我们创建了一个包含579份德文和英文注释病理报告的数据集。六种语言模型(专有:GPT-4;开源:Llama2 13B、Llama2 70B、Llama3 8B、Llama3 70B和Qwen2.5 7B)对它们从这些报告中提取11个关键参数的能力进行评估。此外,我们研究了不同提示工程策略和模型量化技术的模型性能,以评估实际部署场景。结果:本研究表明,开源语言模型从病理报告中提取结构化数据的精度很高,与专有GPT-4模型的精度相当。在不同的模型和配置中,精度差异很大。这些变化取决于在模型部署期间使用的特定快速工程策略和量化方法。结论:开源语言模型在结构化病理报告数据方面表现出与专有解决方案相当的性能。这一发现对医疗机构寻求具有成本效益、保护隐私的数据结构解决方案具有重要意义。模型性能在不同配置中的变化为病理部门的实际部署提供了有价值的见解。我们的公开双语数据集既可以作为基准,也可以作为未来研究的资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conserved neutrophil degranulation transcripts in HIV-TB coinfected children across East and Southern Africa. Author Correction: Systematic data capture reduces the need for source data verification: exploratory analysis from a phase 2 multicenter randomized controlled platform trial. Association of leukocyte mitochondrial DNA copy number and inflammation with mortality among older adults. Deep learning using electroencephalogram (EEG) data for diagnosing and predicting SSRI response in major depressive disorder. Rectal douching is associated with gut dysbiosis and metabolic disruption in HIV-uninfected men who have sex with men.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1