Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.

IF 1.3 Q3 BIOCHEMICAL RESEARCH METHODS Biology Methods and Protocols Pub Date : 2024-10-04 eCollection Date: 2024-01-01 DOI:10.1093/biomethods/bpae072

Oleg Stroganov, Amber Schedlbauer, Emily Lorenzen, Alex Kadhim, Anna Lobanova, David A Lewis, Jill R Glausier

{"title":"Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.","authors":"Oleg Stroganov, Amber Schedlbauer, Emily Lorenzen, Alex Kadhim, Anna Lobanova, David A Lewis, Jill R Glausier","doi":"10.1093/biomethods/bpae072","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to make unstructured neuropathological data, located in the NeuroBioBank (NBB), follow Findability, Accessibility, Interoperability, and Reusability principles and investigate the potential of large language models (LLMs) in wrangling unstructured neuropathological reports. By making the currently inconsistent and disparate data findable, our overarching goal was to enhance research output and speed. The NBB catalog currently includes information from medical records, interview results, and neuropathological reports. These reports contain crucial information necessary for conducting an in-depth analysis of NBB data but have multiple formats that vary across different NBB biorepositories and change over time. In this study, we focused on a subset of 822 donors with Parkinson's disease (PD) from seven NBB biorepositories. We developed a data model with combined Brain Region and Pathological Findings data at its core. This approach made it easier to build an extraction pipeline and was flexible enough to convert resulting data to Common Data Elements, a standardized data collection tool used by the neuroscience community to improve consistency and facilitate data sharing across studies. This pilot study demonstrated the potential of LLMs in structuring unstructured neuropathological reports of PD patients available in the NBB. The pipeline enabled successful extraction of detailed tissue-level (microscopic) and gross anatomical (macroscopic) observations, along with staging information from pathology reports, with extraction quality comparable to manual curation results. To our knowledge, this is the first attempt to automatically standardize neuropathological information at this scale. The collected data have the potential to serve as a valuable resource for PD researchers, facilitating integration with clinical information and genetic data (such as genome-wide genotyping and whole-genome sequencing) available through the NBB, thereby enabling a more comprehensive understanding of the disease.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"9 1","pages":"bpae072"},"PeriodicalIF":1.3000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11513015/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpae072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The aim of this study was to make unstructured neuropathological data, located in the NeuroBioBank (NBB), follow Findability, Accessibility, Interoperability, and Reusability principles and investigate the potential of large language models (LLMs) in wrangling unstructured neuropathological reports. By making the currently inconsistent and disparate data findable, our overarching goal was to enhance research output and speed. The NBB catalog currently includes information from medical records, interview results, and neuropathological reports. These reports contain crucial information necessary for conducting an in-depth analysis of NBB data but have multiple formats that vary across different NBB biorepositories and change over time. In this study, we focused on a subset of 822 donors with Parkinson's disease (PD) from seven NBB biorepositories. We developed a data model with combined Brain Region and Pathological Findings data at its core. This approach made it easier to build an extraction pipeline and was flexible enough to convert resulting data to Common Data Elements, a standardized data collection tool used by the neuroscience community to improve consistency and facilitate data sharing across studies. This pilot study demonstrated the potential of LLMs in structuring unstructured neuropathological reports of PD patients available in the NBB. The pipeline enabled successful extraction of detailed tissue-level (microscopic) and gross anatomical (macroscopic) observations, along with staging information from pathology reports, with extraction quality comparable to manual curation results. To our knowledge, this is the first attempt to automatically standardize neuropathological information at this scale. The collected data have the potential to serve as a valuable resource for PD researchers, facilitating integration with clinical information and genetic data (such as genome-wide genotyping and whole-genome sequencing) available through the NBB, thereby enabling a more comprehensive understanding of the disease.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

解读非结构化数据：利用大型语言模型从帕金森病患者的神经病理学报告中提取见解的试点研究。

本研究旨在使神经生物库（NBB）中的非结构化神经病理学数据遵循可查找性、可访问性、互操作性和可重用性原则，并研究大型语言模型（LLM）在处理非结构化神经病理学报告方面的潜力。我们的总体目标是提高研究成果和速度，从而使目前不一致且分散的数据变得可查找。NBB 目录目前包括来自医疗记录、访谈结果和神经病理报告的信息。这些报告包含了对 NBB 数据进行深入分析所必需的关键信息，但它们有多种格式，在不同的 NBB 生物库中各不相同，而且会随着时间的推移而变化。在本研究中，我们重点研究了来自七个 NBB 生物库的 822 位帕金森病（PD）供体的子集。我们开发了一个以脑区和病理结果数据为核心的数据模型。这种方法使我们更容易建立提取管道，并能灵活地将提取的数据转换为通用数据元素（Common Data Elements），通用数据元素是神经科学界使用的标准化数据收集工具，可提高一致性并促进跨研究的数据共享。这项试点研究证明了 LLMs 在构建 NBB 中现有的帕金森病患者非结构化神经病理学报告方面的潜力。该管道成功提取了病理报告中详细的组织水平（显微镜下）和大体解剖（宏观上）观察结果以及分期信息，提取质量与人工整理结果相当。据我们所知，这是首次尝试以这种规模自动标准化神经病理学信息。收集到的数据有可能成为帕金森病研究人员的宝贵资源，促进与临床信息和通过 NBB 提供的基因数据（如全基因组基因分型和全基因组测序）的整合，从而更全面地了解该疾病。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊