Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.

IF 2.5 Q3 BIOCHEMICAL RESEARCH METHODS Biology Methods and Protocols Pub Date : 2024-10-04 eCollection Date: 2024-01-01 DOI:10.1093/biomethods/bpae072
Oleg Stroganov, Amber Schedlbauer, Emily Lorenzen, Alex Kadhim, Anna Lobanova, David A Lewis, Jill R Glausier
{"title":"Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.","authors":"Oleg Stroganov, Amber Schedlbauer, Emily Lorenzen, Alex Kadhim, Anna Lobanova, David A Lewis, Jill R Glausier","doi":"10.1093/biomethods/bpae072","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to make unstructured neuropathological data, located in the NeuroBioBank (NBB), follow Findability, Accessibility, Interoperability, and Reusability principles and investigate the potential of large language models (LLMs) in wrangling unstructured neuropathological reports. By making the currently inconsistent and disparate data findable, our overarching goal was to enhance research output and speed. The NBB catalog currently includes information from medical records, interview results, and neuropathological reports. These reports contain crucial information necessary for conducting an in-depth analysis of NBB data but have multiple formats that vary across different NBB biorepositories and change over time. In this study, we focused on a subset of 822 donors with Parkinson's disease (PD) from seven NBB biorepositories. We developed a data model with combined Brain Region and Pathological Findings data at its core. This approach made it easier to build an extraction pipeline and was flexible enough to convert resulting data to Common Data Elements, a standardized data collection tool used by the neuroscience community to improve consistency and facilitate data sharing across studies. This pilot study demonstrated the potential of LLMs in structuring unstructured neuropathological reports of PD patients available in the NBB. The pipeline enabled successful extraction of detailed tissue-level (microscopic) and gross anatomical (macroscopic) observations, along with staging information from pathology reports, with extraction quality comparable to manual curation results. To our knowledge, this is the first attempt to automatically standardize neuropathological information at this scale. The collected data have the potential to serve as a valuable resource for PD researchers, facilitating integration with clinical information and genetic data (such as genome-wide genotyping and whole-genome sequencing) available through the NBB, thereby enabling a more comprehensive understanding of the disease.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":null,"pages":null},"PeriodicalIF":2.5000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11513015/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpae072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study was to make unstructured neuropathological data, located in the NeuroBioBank (NBB), follow Findability, Accessibility, Interoperability, and Reusability principles and investigate the potential of large language models (LLMs) in wrangling unstructured neuropathological reports. By making the currently inconsistent and disparate data findable, our overarching goal was to enhance research output and speed. The NBB catalog currently includes information from medical records, interview results, and neuropathological reports. These reports contain crucial information necessary for conducting an in-depth analysis of NBB data but have multiple formats that vary across different NBB biorepositories and change over time. In this study, we focused on a subset of 822 donors with Parkinson's disease (PD) from seven NBB biorepositories. We developed a data model with combined Brain Region and Pathological Findings data at its core. This approach made it easier to build an extraction pipeline and was flexible enough to convert resulting data to Common Data Elements, a standardized data collection tool used by the neuroscience community to improve consistency and facilitate data sharing across studies. This pilot study demonstrated the potential of LLMs in structuring unstructured neuropathological reports of PD patients available in the NBB. The pipeline enabled successful extraction of detailed tissue-level (microscopic) and gross anatomical (macroscopic) observations, along with staging information from pathology reports, with extraction quality comparable to manual curation results. To our knowledge, this is the first attempt to automatically standardize neuropathological information at this scale. The collected data have the potential to serve as a valuable resource for PD researchers, facilitating integration with clinical information and genetic data (such as genome-wide genotyping and whole-genome sequencing) available through the NBB, thereby enabling a more comprehensive understanding of the disease.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
解读非结构化数据:利用大型语言模型从帕金森病患者的神经病理学报告中提取见解的试点研究。
本研究旨在使神经生物库(NBB)中的非结构化神经病理学数据遵循可查找性、可访问性、互操作性和可重用性原则,并研究大型语言模型(LLM)在处理非结构化神经病理学报告方面的潜力。我们的总体目标是提高研究成果和速度,从而使目前不一致且分散的数据变得可查找。NBB 目录目前包括来自医疗记录、访谈结果和神经病理报告的信息。这些报告包含了对 NBB 数据进行深入分析所必需的关键信息,但它们有多种格式,在不同的 NBB 生物库中各不相同,而且会随着时间的推移而变化。在本研究中,我们重点研究了来自七个 NBB 生物库的 822 位帕金森病(PD)供体的子集。我们开发了一个以脑区和病理结果数据为核心的数据模型。这种方法使我们更容易建立提取管道,并能灵活地将提取的数据转换为通用数据元素(Common Data Elements),通用数据元素是神经科学界使用的标准化数据收集工具,可提高一致性并促进跨研究的数据共享。这项试点研究证明了 LLMs 在构建 NBB 中现有的帕金森病患者非结构化神经病理学报告方面的潜力。该管道成功提取了病理报告中详细的组织水平(显微镜下)和大体解剖(宏观上)观察结果以及分期信息,提取质量与人工整理结果相当。据我们所知,这是首次尝试以这种规模自动标准化神经病理学信息。收集到的数据有可能成为帕金森病研究人员的宝贵资源,促进与临床信息和通过 NBB 提供的基因数据(如全基因组基因分型和全基因组测序)的整合,从而更全面地了解该疾病。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Biology Methods and Protocols
Biology Methods and Protocols Agricultural and Biological Sciences-Agricultural and Biological Sciences (all)
CiteScore
3.80
自引率
2.80%
发文量
28
审稿时长
19 weeks
期刊最新文献
Optimizing Western blotting immunodetection: Streamlining antibody cocktails for reduced protocol time and enhanced multiplexing applications. Live cell fluorescence microscopy-an end-to-end workflow for high-throughput image and data analysis. A reproducible method to study traumatic injury-induced zebrafish brain regeneration. Cluster analysis identifies long COVID subtypes in Belgian patients. Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1