Comprehensive testing of large language models for extraction of structured data in pathology.

IF 5.4 Q1 MEDICINE, RESEARCH & EXPERIMENTAL Communications medicine Pub Date : 2025-03-31 DOI:10.1038/s43856-025-00808-8

Bastian Grothey, Jan Odenkirchen, Adnan Brkic, Birgid Schömig-Markiefka, Alexander Quaas, Reinhard Büttner, Yuri Tolkach

{"title":"Comprehensive testing of large language models for extraction of structured data in pathology.","authors":"Bastian Grothey, Jan Odenkirchen, Adnan Brkic, Birgid Schömig-Markiefka, Alexander Quaas, Reinhard Büttner, Yuri Tolkach","doi":"10.1038/s43856-025-00808-8","DOIUrl":null,"url":null,"abstract":"Background: Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed.Methods: We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios.Results: Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment.Conclusions: Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.","PeriodicalId":72646,"journal":{"name":"Communications medicine","volume":"5 1","pages":"96"},"PeriodicalIF":5.4000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11958830/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s43856-025-00808-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed.

Methods: We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios.

Results: Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment.

Conclusions: Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

病理学中结构化数据提取的大型语言模型的综合测试。

背景：病理部门生成大量的非结构化数据作为自由文本诊断报告。将这些报告转换为用于分析或人工智能项目的结构化格式需要专业人员进行大量的手工工作。虽然最近的研究表明，使用先进的语言模型来构建病理数据是有希望的，但它们主要依赖于专有模型，这增加了成本和隐私问题。此外，在消费级硬件上部署的快速工程和模型量化等重要方面仍未得到解决。方法：我们创建了一个包含579份德文和英文注释病理报告的数据集。六种语言模型(专有：GPT-4；开源：Llama2 13B、Llama2 70B、Llama3 8B、Llama3 70B和Qwen2.5 7B)对它们从这些报告中提取11个关键参数的能力进行评估。此外，我们研究了不同提示工程策略和模型量化技术的模型性能，以评估实际部署场景。结果：本研究表明，开源语言模型从病理报告中提取结构化数据的精度很高，与专有GPT-4模型的精度相当。在不同的模型和配置中，精度差异很大。这些变化取决于在模型部署期间使用的特定快速工程策略和量化方法。结论：开源语言模型在结构化病理报告数据方面表现出与专有解决方案相当的性能。这一发现对医疗机构寻求具有成本效益、保护隐私的数据结构解决方案具有重要意义。模型性能在不同配置中的变化为病理部门的实际部署提供了有价值的见解。我们的公开双语数据集既可以作为基准，也可以作为未来研究的资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Communications medicine

自引率

0.00%

发文量