A new AI-assisted data standard accelerates interoperability in biomedical research.

Rodney Alan Long, Shannon Ballard, Syed Shah, Owen Bianchi, Lietsel Jones, Mathew J Koretsky, Nicole Kuznetsov, Elise Marsan, Bryant Jen, Phillip Chiang, Abhradeep Mukherjee, Cornelis Blauwendraat, Hampton Leonard, Dan Vitale, Kristin Levine, Sara Bandres-Ciga, Paige Jarreau, Patrick Brannely, Caroline Pantazis, Laurel Screven, Kate Andersh, Alifiya Kapasi, John F Crary, David Gutman, Brittany N Dugger, Sarah Biber, Tim Hohman, Faraz Faghri, Michael Griswold, Lana Sargent, Kendall van Keuren-Jensen, Andrew B Singleton, Yang Fann, Mike A Nalls, Hirotaka Iwaki
{"title":"A new AI-assisted data standard accelerates interoperability in biomedical research.","authors":"Rodney Alan Long, Shannon Ballard, Syed Shah, Owen Bianchi, Lietsel Jones, Mathew J Koretsky, Nicole Kuznetsov, Elise Marsan, Bryant Jen, Phillip Chiang, Abhradeep Mukherjee, Cornelis Blauwendraat, Hampton Leonard, Dan Vitale, Kristin Levine, Sara Bandres-Ciga, Paige Jarreau, Patrick Brannely, Caroline Pantazis, Laurel Screven, Kate Andersh, Alifiya Kapasi, John F Crary, David Gutman, Brittany N Dugger, Sarah Biber, Tim Hohman, Faraz Faghri, Michael Griswold, Lana Sargent, Kendall van Keuren-Jensen, Andrew B Singleton, Yang Fann, Mike A Nalls, Hirotaka Iwaki","doi":"10.1101/2024.10.17.24315618","DOIUrl":null,"url":null,"abstract":"<p><p>In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11527042/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.10.17.24315618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
新的人工智能辅助数据标准加快了生物医学研究的互操作性。
在本文中,我们利用大型语言模型(LLMs)来加速数据整理,并自动处理数据发现和协调的劳动密集型环节。这项工作促进了互操作性标准并增强了数据发现,通过生成通用数据元素(CDE)作为协调多个数据集的关键,促进了生物医学科学中的人工智能准备工作。31 项研究、各种本体论和医疗编码系统是创建 CDE 的源材料,其中可用的元数据和上下文作为 API 请求发送给第四代 OpenAI GPT 模型,以填充每个元数据字段。我们采用人在回路(HITL)方法来评估生成 CDE 的质量和准确性。为了规范 CDE 的生成,我们使用了 ElasticSearch 和 HITL 来避免重复的 CDE,而是将它们添加为现有 CDE 的潜在别名。通过确定有多少数据集列标题可以正确映射到 CDE,以及量化允许值和数据类型的合规性,生成的 CDE 是评估数据集互操作性潜力的基础。主题专家审查了生成的 CDE,确定 94.0% 的生成元数据字段不需要手动修改。阿尔茨海默病神经影像计划(ADNI)和全球帕金森病基因计划(GP2)的数据表被用作互操作性评估的测试案例。互操作性评分是衡量数据集与 CDE 和其他连接数据集兼容性的指标,基于数据字段完整性和是否符合通用协调标准等相关标准,测试用例的平均评分为 53.8 分(满分 100 分)。通过这个项目,我们的目标是将数据协调中最繁琐的环节自动化,提高生物医学研究的效率和可扩展性,同时降低联合研究的激活能量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Opioids Diminish the Placebo Antidepressant Response: A Post Hoc Analysis of a Randomized Controlled Ketamine Trial. Raising awareness of potential biases in medical machine learning: Experience from a Datathon. Prediction of Postoperative Delirium in Older Adults from Preoperative Cognition and Occipital Alpha Power from Resting-State Electroencephalogram. Reduced Cortical Excitability is Associated with Cognitive Symptoms in Concussed Adolescent Football Players. Basic helix-loop-helix transcription factor BHLHE22 monoallelic and biallelic variants cause a neurodevelopmental disorder with agenesis of the corpus callosum, intellectual disability, tone and movement abnormalities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1