Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished?

Santiago Nogué-Xarau, José Ríos-Guillermo, Montserrat Amigó-Tadín
{"title":"Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished?","authors":"Santiago Nogué-Xarau, José Ríos-Guillermo, Montserrat Amigó-Tadín","doi":"10.55633/s3me/082.2024","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To present questions about poisoning to 4 artificial intelligence (AI) systems and 4 clinical toxicologists and determine whether readers can identify the source of the answers. To evaluate and compare text quality and level of knowledge found in the AI and toxicologists' responses.</p><p><strong>Methods: </strong>Ten questions about toxicology were presented to the following AI systems: Copilot, Bard, Luzia, and ChatGPT. Four clinical toxicologists were asked to answer the same questions. Twenty-four recruited experts in toxicology were sent a pair of answers (1 from an AI system and one from a toxicologist) for each of the 10 questions. For each answer, the experts had to identify the source, evaluate text quality, and assess level of knowledge reflected. Quantitative variables were described as mean (SD) and qualitative ones as absolute frequency and proportion. A value of P .05 was considered significant in all comparisons.</p><p><strong>Results: </strong>Of the 240 evaluated AI answers, the expert evaluators thought that 21 (8.8%) and 38 (15.8%), respectively, were certainly or probably written by a toxicologist. The experts were unable to guess the source of 13 (5.4%) AI answers. Luzia and ChatGPT were better able to mislead the experts than Bard (P = .036 and P = .041, respectively). Text quality was judged excellent in 38.8% of the AI answers. ChatGPT text quality was rated highest (61.3% excellent) vs Bard (34.4%), Luzia (31.7%), and Copilot (26.3%) (P .001, all comparisons). The average score for the level of knowledge perceived in the AI answers was 7.23 (1.57) out of 10. The highest average score was achieved by ChatGPT at 8.03 (1.26) vs Luzia (7.02 [1,63]), Bard (6.91 [1.64]), and Copilot (6.91 [1.46]) (P .001, all comparisons).</p><p><strong>Conclusions: </strong>Luzia and ChatGPT answers to the toxicology questions were often thought to resemble those of clinical toxicologists. ChatGPT answers were judged to be very well-written and reflect a very high level of knowledge.</p>","PeriodicalId":93987,"journal":{"name":"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias","volume":"36 5","pages":"351-358"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55633/s3me/082.2024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To present questions about poisoning to 4 artificial intelligence (AI) systems and 4 clinical toxicologists and determine whether readers can identify the source of the answers. To evaluate and compare text quality and level of knowledge found in the AI and toxicologists' responses.

Methods: Ten questions about toxicology were presented to the following AI systems: Copilot, Bard, Luzia, and ChatGPT. Four clinical toxicologists were asked to answer the same questions. Twenty-four recruited experts in toxicology were sent a pair of answers (1 from an AI system and one from a toxicologist) for each of the 10 questions. For each answer, the experts had to identify the source, evaluate text quality, and assess level of knowledge reflected. Quantitative variables were described as mean (SD) and qualitative ones as absolute frequency and proportion. A value of P .05 was considered significant in all comparisons.

Results: Of the 240 evaluated AI answers, the expert evaluators thought that 21 (8.8%) and 38 (15.8%), respectively, were certainly or probably written by a toxicologist. The experts were unable to guess the source of 13 (5.4%) AI answers. Luzia and ChatGPT were better able to mislead the experts than Bard (P = .036 and P = .041, respectively). Text quality was judged excellent in 38.8% of the AI answers. ChatGPT text quality was rated highest (61.3% excellent) vs Bard (34.4%), Luzia (31.7%), and Copilot (26.3%) (P .001, all comparisons). The average score for the level of knowledge perceived in the AI answers was 7.23 (1.57) out of 10. The highest average score was achieved by ChatGPT at 8.03 (1.26) vs Luzia (7.02 [1,63]), Bard (6.91 [1.64]), and Copilot (6.91 [1.46]) (P .001, all comparisons).

Conclusions: Luzia and ChatGPT answers to the toxicology questions were often thought to resemble those of clinical toxicologists. ChatGPT answers were judged to be very well-written and reflect a very high level of knowledge.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
比较人工智能系统和临床毒理学家对中毒问题的回答:它们的答案可以区分吗?
目的:向 4 个人工智能(AI)系统和 4 位临床毒理学家提出有关中毒的问题,并确定读者能否识别答案的来源。评估并比较人工智能和毒理学专家回答的文本质量和知识水平:方法:向以下人工智能系统提出 10 个有关毒理学的问题:Copilot、Bard、Luzia 和 ChatGPT。四位临床毒理学专家被要求回答同样的问题。二十四位受聘的毒理学专家就这 10 个问题的每个问题都收到了一对答案(一个来自人工智能系统,一个来自毒理学专家)。对于每个答案,专家们必须确定来源、评估文本质量并评估所反映的知识水平。定量变量以平均值(SD)表示,定性变量以绝对频率和比例表示。所有比较均以 P .05 为显著值:结果:在 240 份经评估的人工智能答案中,专家评估员认为肯定或可能由毒理学家撰写的答案分别为 21 份(8.8%)和 38 份(15.8%)。专家们无法猜测 13 条(5.4%)人工智能答案的来源。Luzia 和 ChatGPT 比 Bard 更能误导专家(P = .036 和 P = .041)。38.8%的人工智能答案的文本质量被评为优秀。与 Bard(34.4%)、Luzia(31.7%)和 Copilot(26.3%)相比,ChatGPT 的文本质量被评为最高(61.3% 为优秀)(P.001,所有比较)。在人工智能答案中感知到的知识水平的平均得分为 7.23 (1.57)(满分 10 分)。与 Luzia(7.02 [1,63])、Bard(6.91 [1.64])和 Copilot(6.91 [1.46])相比,ChatGPT 的平均得分最高,为 8.03 (1.26)(P.001,所有比较):人们通常认为,Luzia 和 ChatGPT 对毒理学问题的回答与临床毒理学专家的回答相似。ChatGPT 的答案被认为写得很好,反映了很高的知识水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Adherence to the Extracorporeal Treatments in Poisoning Workgroup recommendations for lithium intoxication: the SILITOX study. Artificial-intelligence-based neurological outcome prediction during out-of-hospital cardiac arrest. Characteristics and short- and long-term outcomes in patients aged 65 years or older living in nursing homes: the Emergency Department and Elder Needs-40 study. Concordance between risk assessment scales for venous thromboembolism in medical patients in the emergency department. Effectiveness and safety of vernakalant vs flecainide for cardioversion of atrial fibrillation in the emergency department: the VERITA study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1