Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing.

IF 2.5 4区医学 Q2 PATHOLOGY Journal of Clinical Pathology Pub Date : 2025-01-17 DOI:10.1136/jcp-2024-209669

Ruben Geevarghese, Carlie Sigel, John Cadley, Subrata Chatterjee, Pulkit Jain, Alex Hollingsworth, Avijit Chatterjee, Nathaniel Swinburne, Khawaja Hasan Bilal, Brett Marinelli

{"title":"Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing.","authors":"Ruben Geevarghese, Carlie Sigel, John Cadley, Subrata Chatterjee, Pulkit Jain, Alex Hollingsworth, Avijit Chatterjee, Nathaniel Swinburne, Khawaja Hasan Bilal, Brett Marinelli","doi":"10.1136/jcp-2024-209669","DOIUrl":null,"url":null,"abstract":"Aims: Structured reporting in pathology is not universally adopted and extracting elements essential to research often requires expensive and time-intensive manual curation. The accuracy and feasibility of using large language models (LLMs) to extract essential pathology elements, for cancer research is examined here.Methods: Retrospective study of patients who underwent pathology sampling for suspected hepatocellular carcinoma and underwent Ytrrium-90 embolisation. Five pathology report elements of interest were included for evaluation. LLMs (Generative Pre-trained Transformer (GPT) 3.5 turbo and GPT-4) were used to extract elements of interest. For comparison, a rules-based, regular expressions (REGEX) approach was devised for extraction. Accuracy for each approach was calculated.Results: 88 pathology reports were identified. LLMs and REGEX were both able to extract research elements with high accuracy (average 84.1%-94.8%).Conclusions: LLMs have significant potential to simplify the extraction of research elements from pathology reporting, and therefore, accelerate the pace of cancer research.","PeriodicalId":15391,"journal":{"name":"Journal of Clinical Pathology","volume":" ","pages":"135-138"},"PeriodicalIF":2.5000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Pathology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/jcp-2024-209669","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PATHOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Aims: Structured reporting in pathology is not universally adopted and extracting elements essential to research often requires expensive and time-intensive manual curation. The accuracy and feasibility of using large language models (LLMs) to extract essential pathology elements, for cancer research is examined here.

Methods: Retrospective study of patients who underwent pathology sampling for suspected hepatocellular carcinoma and underwent Ytrrium-90 embolisation. Five pathology report elements of interest were included for evaluation. LLMs (Generative Pre-trained Transformer (GPT) 3.5 turbo and GPT-4) were used to extract elements of interest. For comparison, a rules-based, regular expressions (REGEX) approach was devised for extraction. Accuracy for each approach was calculated.

Results: 88 pathology reports were identified. LLMs and REGEX were both able to extract research elements with high accuracy (average 84.1%-94.8%).

Conclusions: LLMs have significant potential to simplify the extraction of research elements from pathology reporting, and therefore, accelerate the pace of cancer research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型从非结构化肝胆病理报告中提取结构化数据并进行分类：与基于规则的自然语言处理进行比较的可行性研究。

目的：病理学中的结构化报告并没有得到普遍采用，提取对研究至关重要的内容往往需要昂贵且耗时的人工整理。本文探讨了使用大型语言模型（LLMs）提取癌症研究必需病理要素的准确性和可行性：方法：对因怀疑患有肝细胞癌而进行病理取样并接受 Ytrrium-90 栓塞术的患者进行回顾性研究。评估包括五项相关病理报告要素。使用 LLM（生成预训练变换器 (GPT) 3.5 turbo 和 GPT-4）提取感兴趣的元素。为了进行比较，还设计了一种基于规则的正则表达式 (REGEX) 方法进行提取。计算了每种方法的准确性：共识别出 88 份病理报告。LLM 和 REGEX 都能以较高的准确率（平均 84.1%-94.8%）提取研究元素：LLMs 在简化病理报告中研究元素的提取方面具有巨大潜力，因此可以加快癌症研究的步伐。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Clinical Pathology 医学-病理学

CiteScore

7.80

自引率

2.90%

发文量

113

审稿时长

3-8 weeks

期刊介绍： Journal of Clinical Pathology is a leading international journal covering all aspects of pathology. Diagnostic and research areas covered include histopathology, virology, haematology, microbiology, cytopathology, chemical pathology, molecular pathology, forensic pathology, dermatopathology, neuropathology and immunopathology. Each issue contains Reviews, Original articles, Short reports, Correspondence and more.