Ricardo Bigolin Lanfredi, Pritam Mukherjee, Ronald Summers
{"title":"用保护隐私的大型语言模型和多类型注释增强胸部 X 光数据集:改进分类的数据驱动方法。","authors":"Ricardo Bigolin Lanfredi, Pritam Mukherjee, Ronald Summers","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>In chest X-ray (CXR) image analysis, rule-based systems are usually employed to extract labels from reports for dataset releases. However, there is still room for improvement in label quality. These labelers typically output only presence labels, sometimes with binary uncertainty indicators, which limits their usefulness. Supervised deep learning models have also been developed for report labeling but lack adaptability, similar to rule-based systems. In this work, we present MAPLEZ (Medical report Annotations with Privacy-preserving Large language model using Expeditious Zero shot answers), a novel approach leveraging a locally executable Large Language Model (LLM) to extract and enhance findings labels on CXR reports. MAPLEZ extracts not only binary labels indicating the presence or absence of a finding but also the location, severity, and radiologists' uncertainty about the finding. Over eight abnormalities from five test sets, we show that our method can extract these annotations with an increase of 3.6 percentage points (pp) in macro F1 score for categorical presence annotations and more than 20 pp increase in F1 score for the location annotations over competing labelers. Additionally, using the combination of improved annotations and multi-type annotations in classification supervision, we demonstrate substantial advancements in model quality, with an increase of 1.1 pp in AUROC over models trained with annotations from the best alternative approach. We share code and annotations.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11071620/pdf/","citationCount":"0","resultStr":"{\"title\":\"Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: a data-driven approach for improved classification.\",\"authors\":\"Ricardo Bigolin Lanfredi, Pritam Mukherjee, Ronald Summers\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In chest X-ray (CXR) image analysis, rule-based systems are usually employed to extract labels from reports for dataset releases. However, there is still room for improvement in label quality. These labelers typically output only presence labels, sometimes with binary uncertainty indicators, which limits their usefulness. Supervised deep learning models have also been developed for report labeling but lack adaptability, similar to rule-based systems. In this work, we present MAPLEZ (Medical report Annotations with Privacy-preserving Large language model using Expeditious Zero shot answers), a novel approach leveraging a locally executable Large Language Model (LLM) to extract and enhance findings labels on CXR reports. MAPLEZ extracts not only binary labels indicating the presence or absence of a finding but also the location, severity, and radiologists' uncertainty about the finding. Over eight abnormalities from five test sets, we show that our method can extract these annotations with an increase of 3.6 percentage points (pp) in macro F1 score for categorical presence annotations and more than 20 pp increase in F1 score for the location annotations over competing labelers. Additionally, using the combination of improved annotations and multi-type annotations in classification supervision, we demonstrate substantial advancements in model quality, with an increase of 1.1 pp in AUROC over models trained with annotations from the best alternative approach. We share code and annotations.</p>\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11071620/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在胸部 X 光(CXR)图像分析中,通常采用基于规则的系统从报告中提取标签,但标签质量令人担忧。这些数据集通常只提供存在标签,有时还带有二进制不确定性指标,这限制了它们的实用性。在这项工作中,我们提出了 MAPLEZ(使用快速零枪答案的隐私保护大语言模型医学报告注释),这是一种利用本地可执行大语言模型(LLM)来提取和增强 CXR 报告中的发现标签的新方法。MAPLEZ 不仅能提取表示有无发现的二进制标签,还能提取发现的位置、严重程度和放射医师对发现的不确定性。在五个测试集中的八个异常情况中,我们证明了我们的方法可以提取这些注释,分类存在注释的 F1 分数提高了 5 个百分点 (pp),位置注释的 F1 分数比竞争标签器提高了 30 多个百分点。此外,在分类监督中使用这些改进的注释,我们证明了模型质量的大幅提升,与使用最先进方法注释训练的模型相比,AUROC 提高了 1.7 个百分点。我们共享代码和注释。
Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: a data-driven approach for improved classification.
In chest X-ray (CXR) image analysis, rule-based systems are usually employed to extract labels from reports for dataset releases. However, there is still room for improvement in label quality. These labelers typically output only presence labels, sometimes with binary uncertainty indicators, which limits their usefulness. Supervised deep learning models have also been developed for report labeling but lack adaptability, similar to rule-based systems. In this work, we present MAPLEZ (Medical report Annotations with Privacy-preserving Large language model using Expeditious Zero shot answers), a novel approach leveraging a locally executable Large Language Model (LLM) to extract and enhance findings labels on CXR reports. MAPLEZ extracts not only binary labels indicating the presence or absence of a finding but also the location, severity, and radiologists' uncertainty about the finding. Over eight abnormalities from five test sets, we show that our method can extract these annotations with an increase of 3.6 percentage points (pp) in macro F1 score for categorical presence annotations and more than 20 pp increase in F1 score for the location annotations over competing labelers. Additionally, using the combination of improved annotations and multi-type annotations in classification supervision, we demonstrate substantial advancements in model quality, with an increase of 1.1 pp in AUROC over models trained with annotations from the best alternative approach. We share code and annotations.