Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare.

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Pub Date : 2024-05-31 eCollection Date: 2024-01-01

Prosanta Barai, Gondy Leroy, Prakash Bisht, Joshua M Rothman, Sumi Lee, Jennifer Andrews, Sydney A Rice, Arif Ahmed

{"title":"Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare.","authors":"Prosanta Barai, Gondy Leroy, Prakash Bisht, Joshua M Rothman, Sumi Lee, Jennifer Andrews, Sydney A Rice, Arif Ahmed","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.</p>","PeriodicalId":72181,"journal":{"name":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","volume":"2024 ","pages":"75-84"},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141838/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.

微信好友朋友圈 QQ好友复制链接

本刊更多论文

增强数据质量保证的众包：缓解医疗保健大型语言模型训练中资源稀缺挑战的有效方法。

大型语言模型（LLM）在包括医疗保健在内的各个领域的人工智能中都展现出了巨大的潜力。然而，由于需要高质量的标注数据，这些数据的创建通常既昂贵又耗时，尤其是在医疗保健等资源匮乏的领域，这就阻碍了它们的功效。为了应对这些挑战，我们提出了一个众包（CS）框架，该框架在数据收集前、实时和收集后阶段都加入了质量控制措施。我们的研究通过数据质量对预测自闭症相关症状的 LLMs（Bio-BERT）的影响，评估了提高数据质量的有效性。结果表明，与质量控制前相比，实时质量控制可将数据质量提高 19%。与 Bio-BERT 基线相比，使用众包数据对 Bio-BERT 进行微调普遍提高了召回率，但降低了精确度。我们的研究结果凸显了众包和质量控制在资源受限环境中的潜力，并为优化医疗保健 LLMs 以做出明智决策和改善患者护理提供了启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

自引率

0.00%

发文量