Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19 Impacts in Pregnancy: A Cross-Sectional Pilot Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES JMIR Formative Research Pub Date : 2025-01-09 DOI:10.2196/56126
Nicola Bragazzi, Michèle Buchinger, Hisham Atwan, Ruba Tuma, Francesco Chirico, Lukasz Szarpak, Raymond Farah, Rola Khamisy-Farah
{"title":"Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19 Impacts in Pregnancy: A Cross-Sectional Pilot Study.","authors":"Nicola Bragazzi, Michèle Buchinger, Hisham Atwan, Ruba Tuma, Francesco Chirico, Lukasz Szarpak, Raymond Farah, Rola Khamisy-Farah","doi":"10.2196/56126","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The COVID-19 pandemic has significantly strained healthcare systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an \"infodemic\" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for healthcare providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. Additionally, the advent of generative Artificial Intelligence (AI), such as large language models (LLMs), offers promising support in healthcare. However, they necessitate rigorous testing.</p><p><strong>Objective: </strong>To assess LLMs' proficiency, clarity, and objectivity regarding COVID-19 impacts in pregnancy.</p><p><strong>Methods: </strong>This study evaluates four major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted.</p><p><strong>Results: </strong>In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (n=32/33), Google Bard 94% (n=31/33), and ChatGPT-3.5 82% (n=27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. At the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard ( -7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses.</p><p><strong>Conclusions: </strong>The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective healthcare communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing healthcare knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/56126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The COVID-19 pandemic has significantly strained healthcare systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an "infodemic" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for healthcare providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. Additionally, the advent of generative Artificial Intelligence (AI), such as large language models (LLMs), offers promising support in healthcare. However, they necessitate rigorous testing.

Objective: To assess LLMs' proficiency, clarity, and objectivity regarding COVID-19 impacts in pregnancy.

Methods: This study evaluates four major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted.

Results: In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (n=32/33), Google Bard 94% (n=31/33), and ChatGPT-3.5 82% (n=27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. At the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard ( -7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses.

Conclusions: The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective healthcare communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing healthcare knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.

Clinicaltrial:

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型与专家对COVID-19对妊娠影响的了解的熟练程度、清晰度和客观性:一项横断面试点研究
背景:2019冠状病毒病大流行给全球卫生保健系统造成了严重压力,导致大量患者涌入,加剧了资源限制。与此同时,出现了一种错误信息的“信息流行”,特别是在妇女保健方面。这一挑战对医疗保健提供者,特别是妇科和产科医生在管理孕妇健康方面至关重要。大流行增加了孕妇感染COVID-19的风险,因此需要专家就疫苗安全性与已知风险提供平衡的建议。此外,生成式人工智能(AI)的出现,如大型语言模型(llm),为医疗保健提供了有希望的支持。然而,它们需要严格的测试。目的:评价法学硕士对COVID-19对妊娠影响的熟练程度、清晰度和客观性。方法:本研究对以色列159名妇产科医生进行了问卷调查,使用零射击提示对四种主要的人工智能原型(ChatGPT-3.5、ChatGPT-4、Microsoft Copilot和b谷歌Bard)进行了评估。该问卷评估提供与妊娠有关的COVID-19准确信息的熟练程度。还进行了文本挖掘、情感分析和可读性(Flesch- kincaid年级水平和Flesch阅读简易评分)。结果:在法学硕士的知识方面,ChatGPT-4和Microsoft Copilot得分分别为97% (n=32/33), b谷歌巴德得分为94% (n=31/33), ChatGPT-3.5得分为82% (n=27/33)。ChatGPT-4错误地陈述了COVID-19导致流产的风险增加。b谷歌巴德和微软副驾驶在COVID-19传播和并发症方面有轻微的不准确。在情感分析中,Microsoft Copilot得分最低(-4分),其次是ChatGPT-4(-6分)和b谷歌Bard(-7分),ChatGPT-3.5得分最高(-12分)。最后,在可读性分析方面,Flesch- kincaid Grade Level和Flesch Reading Ease Score显示,Microsoft Copilot的可读性最高,分别为9.9和49,其次是ChatGPT-4,分别为12.4和37.1,而ChatGPT-3.5(12.9和35.6)和谷歌Bard(12.9和35.8)的反应尤为复杂。结论:该研究突出了法学硕士与COVID-19和妊娠相关的不同知识水平。ChatGPT-3.5表现出最少的知识和与科学证据的一致性。可读性和复杂性分析表明,每种人工智能的方法都是为特定的受众量身定制的,ChatGPT版本更适合专业读者,而微软的Copilot更适合普通大众。情感分析显示,法学硕士传达关键信息的方式存在显著差异,强调了中立和客观的医疗保健沟通在确保孕妇(特别是在COVID-19大流行期间的弱势群体)获得准确和可靠的指导方面的重要作用。总体而言,ChatGPT-4、微软Copilot和b谷歌Bard总体上提供了关于COVID-19和母婴健康疫苗的准确、最新信息,符合健康指南。该研究证明了人工智能在补充医疗保健知识方面的潜在作用,需要不断更新和验证人工智能知识库。人工智能工具的选择应考虑目标受众和所需的信息细节级别。临床试验:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
期刊最新文献
Nurses' Perspectives and Experiences of Using a Bed-Exit Information System in an Acute Hospital Setting: Mixed Methods Study. Addressing the "Black Hole" of Low Back Pain Care With Clinical Decision Support: User-Centered Design and Initial Usability Study. Effectiveness of Cognitive Behavioral Therapy Provided Through a Web Application for Subthreshold Depression, Subthreshold Insomnia, and Subthreshold Panic: Open-Labeled 6-Arm Randomized Clinical Trial Pilot Study. eHealth Literacy and Cyberchondria Severity Among Undergraduate Students: Mixed Methods Study. Summer Research Internship Curriculum to Promote Self-Efficacy, Researcher Identity, and Peer-to-Peer Learning: Retrospective Cohort Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1