Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study

IF 7 2区 医学 Q1 BIOLOGY Computers in biology and medicine Pub Date : 2024-10-30 DOI:10.1016/j.compbiomed.2024.109332
João Daniel Mendonça de Moura , Carlos Eduardo Fontana , Vitor Henrique Reis da Silva Lima , Iris de Souza Alves , Paulo André de Melo Santos , Patrícia de Almeida Rodrigues
{"title":"Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study","authors":"João Daniel Mendonça de Moura ,&nbsp;Carlos Eduardo Fontana ,&nbsp;Vitor Henrique Reis da Silva Lima ,&nbsp;Iris de Souza Alves ,&nbsp;Paulo André de Melo Santos ,&nbsp;Patrícia de Almeida Rodrigues","doi":"10.1016/j.compbiomed.2024.109332","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.</div></div><div><h3>Methods</h3><div>In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.</div></div><div><h3>Results</h3><div>Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p &lt; 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p &lt; 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p &lt; 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p &lt; 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p &lt; 0.001).</div></div><div><h3>Conclusions</h3><div>Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"183 ","pages":"Article 109332"},"PeriodicalIF":7.0000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482524014173","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives

This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.

Methods

In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.

Results

Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001).

Conclusions

Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能聊天机器人在牙髓和牙周诊断中的准确性比较:横断面研究
研究目的本研究旨在评估四个人工智能聊天机器人在虚构的牙髓和牙周疾病病例中的诊断准确性和治疗建议性能。此外,它还调查了回复的一致性以及文本顺序和语言对聊天机器人性能的影响:在这项横向比较研究中,创建了 11 个代表各种牙髓和牙周疾病的案例。这些病例以葡萄牙语和英语呈现给四个聊天机器人(ChatGPT 3.5、ChatGPT 4.0、Bard 和 Bing),信息顺序各不相同(先有体征和症状还是先有成像数据)。统计分析包括 Kruskal-Wallis 检验、Dwass-Steel-Critchlow-Fligner 配对比较、简单逻辑回归和二项式检验:必应和 ChatGPT 4.0 的诊断准确率最高(分别为 86.4 % 和 85.3 %),明显优于 ChatGPT 3.5(46.5 %)和 Bard(28.6 %)(P 结论:必应和 ChatGPT 4.0 的诊断准确率分别为 86.4 % 和 85.3 %,明显优于 ChatGPT 3.5(46.5 %)和 Bard(28.6 %):Bing 和 ChatGPT 4.0 的诊断准确率更高,而 Bard 在诊断和治疗建议方面的准确率最低。但是,这些工具的临床应用需要牙医的严格解释,因为聊天机器人的回复并不总是可靠的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computers in biology and medicine
Computers in biology and medicine 工程技术-工程:生物医学
CiteScore
11.70
自引率
10.40%
发文量
1086
审稿时长
74 days
期刊介绍: Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.
期刊最新文献
An adaptive enhanced human memory algorithm for multi-level image segmentation for pathological lung cancer images. Integrating multimodal learning for improved vital health parameter estimation. Riemannian manifold-based geometric clustering of continuous glucose monitoring to improve personalized diabetes management. Transformative artificial intelligence in gastric cancer: Advancements in diagnostic techniques. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1