Performance of artificial intelligence chatbot as a source of patient information on anti-rheumatic drug use in pregnancy

Nurdan Oruçoğlu, Elif Altunel Kılınç
{"title":"Performance of artificial intelligence chatbot as a source of patient information on anti-rheumatic drug use in pregnancy","authors":"Nurdan Oruçoğlu, Elif Altunel Kılınç","doi":"10.28982/josam.7977","DOIUrl":null,"url":null,"abstract":"Background/Aim: Women with rheumatic and musculoskeletal disorders often discontinue using their medications prior to conception or during the few early weeks of pregnancy because drug use during pregnancy frequently results in anxiety. Pregnant women have reported seeking out health-related information from a variety of sources, particularly the Internet, in an attempt to ease their concerns about the use of such medications during pregnancy. The objective of this study was to evaluate the accuracy and completeness of health-related information concerning the use of anti-rheumatic medications during pregnancy as provided by Open Artificial Intelligence (AI's) Chat Generative Pre-trained Transformer (ChatGPT) versions 3.5 and 4, which are widely known AI tools. Methods: In this prospective cross-sectional study, the performances of OpenAI's ChatGPT versions 3.5 and 4 were assessed regarding health information concerning anti-rheumatic drugs during pregnancy using the 2016 European Union of Associations for Rheumatology (EULAR) guidelines as a reference. Fourteen queries from the guidelines were entered into both AI models. Responses were evaluated independently and rated by two evaluators using a predefined 6-point Likert-like scale (1 – completely incorrect to 6 – completely correct) and for completeness using a 3-point Likert-like scale (1 – incomplete to 3 – complete). Inter-rater reliability was evaluated using Cohen’s kappa statistic, and the differences in scores across ChatGPT versions were compared using the Mann–Whitney U test. Results: No statistically significant difference between the mean accuracy scores of GPT versions 3.5 and 4 (5 [1.17] versus 5.07 [1.26]; P=0.769), indicating the resulting scores were between nearly all accurate and correct for both models. Additionally, no statistically significant difference in the mean completeness scores of GPT 3.5 and GPT 4 (2.5 [0.51] vs 2.64 [0.49], P=0.541) was found, indicating scores between adequate and comprehensive for both models. Both models had similar total mean accuracy and completeness scores (3.75 [1.55] versus 3.86 [1.57]; P=0.717). In the GPT 3.5 model, hydroxychloroquine and Leflunomide received the highest full scores for both accuracy and completeness, while methotrexate, Sulfasalazine, Cyclophosphamide, Mycophenolate mofetil, and Tofacitinib received the highest total scores in the GPT 4 model. Nevertheless, for both models, one of the 14 drugs was scored as more incorrect than correct. Conclusions: When considering the safety and compatibility of anti-rheumatic medications during pregnancy, both ChatGPT versions 3.5 and 4 demonstrated satisfactory accuracy and completeness. On the other hand, the research revealed that the responses generated by ChatGPT also contained inaccurate information. Despite its good performance, ChatGPT should not be used as a standalone tool to make decisions about taking medications during pregnancy due to this AI tool’s limitations.","PeriodicalId":30878,"journal":{"name":"International Journal of Surgery and Medicine","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Surgery and Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28982/josam.7977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background/Aim: Women with rheumatic and musculoskeletal disorders often discontinue using their medications prior to conception or during the few early weeks of pregnancy because drug use during pregnancy frequently results in anxiety. Pregnant women have reported seeking out health-related information from a variety of sources, particularly the Internet, in an attempt to ease their concerns about the use of such medications during pregnancy. The objective of this study was to evaluate the accuracy and completeness of health-related information concerning the use of anti-rheumatic medications during pregnancy as provided by Open Artificial Intelligence (AI's) Chat Generative Pre-trained Transformer (ChatGPT) versions 3.5 and 4, which are widely known AI tools. Methods: In this prospective cross-sectional study, the performances of OpenAI's ChatGPT versions 3.5 and 4 were assessed regarding health information concerning anti-rheumatic drugs during pregnancy using the 2016 European Union of Associations for Rheumatology (EULAR) guidelines as a reference. Fourteen queries from the guidelines were entered into both AI models. Responses were evaluated independently and rated by two evaluators using a predefined 6-point Likert-like scale (1 – completely incorrect to 6 – completely correct) and for completeness using a 3-point Likert-like scale (1 – incomplete to 3 – complete). Inter-rater reliability was evaluated using Cohen’s kappa statistic, and the differences in scores across ChatGPT versions were compared using the Mann–Whitney U test. Results: No statistically significant difference between the mean accuracy scores of GPT versions 3.5 and 4 (5 [1.17] versus 5.07 [1.26]; P=0.769), indicating the resulting scores were between nearly all accurate and correct for both models. Additionally, no statistically significant difference in the mean completeness scores of GPT 3.5 and GPT 4 (2.5 [0.51] vs 2.64 [0.49], P=0.541) was found, indicating scores between adequate and comprehensive for both models. Both models had similar total mean accuracy and completeness scores (3.75 [1.55] versus 3.86 [1.57]; P=0.717). In the GPT 3.5 model, hydroxychloroquine and Leflunomide received the highest full scores for both accuracy and completeness, while methotrexate, Sulfasalazine, Cyclophosphamide, Mycophenolate mofetil, and Tofacitinib received the highest total scores in the GPT 4 model. Nevertheless, for both models, one of the 14 drugs was scored as more incorrect than correct. Conclusions: When considering the safety and compatibility of anti-rheumatic medications during pregnancy, both ChatGPT versions 3.5 and 4 demonstrated satisfactory accuracy and completeness. On the other hand, the research revealed that the responses generated by ChatGPT also contained inaccurate information. Despite its good performance, ChatGPT should not be used as a standalone tool to make decisions about taking medications during pregnancy due to this AI tool’s limitations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能聊天机器人作为妊娠期抗风湿药物使用患者信息来源的表现
背景/目的:患有风湿病和肌肉骨骼疾病的妇女通常在怀孕前或怀孕前几周停止使用药物,因为怀孕期间使用药物经常导致焦虑。据孕妇报告,她们从各种来源,特别是互联网上寻找与健康有关的信息,试图减轻她们对怀孕期间使用这类药物的担忧。本研究的目的是评估开放人工智能(AI)聊天生成预训练转换器(ChatGPT)版本3.5和4提供的有关怀孕期间使用抗风湿药的健康相关信息的准确性和完整性,这两个版本是众所周知的人工智能工具。方法:在这项前瞻性横断面研究中,以2016年欧盟风湿病协会(EULAR)指南为参考,评估OpenAI ChatGPT版本3.5和4在妊娠期间抗风湿药物健康信息方面的性能。两个人工智能模型都输入了指南中的14个查询。回答由两名评估者独立评估,并使用预定义的6点李克特式量表(1 -完全不正确至6-完全正确)和3点李克特式量表(1 -不完整至3-完整)进行评分。使用Cohen 's kappa统计来评估评估者之间的信度,使用Mann-Whitney U测试来比较不同版本ChatGPT的分数差异。结果:GPT版本3.5和版本4的平均准确率评分差异无统计学意义(5[1.17]与5.07 [1.26]);P=0.769),表明两种模型的得分都在几乎全部准确和正确之间。此外,GPT 3.5和GPT 4的平均完整性评分差异无统计学意义(2.5 [0.51]vs 2.64 [0.49], P=0.541),表明两种模型的评分介于充分和全面之间。两种模型的总平均准确性和完整性评分相似(3.75 [1.55]vs . 3.86 [1.57]);P = 0.717)。在GPT 3.5模型中,羟氯喹和来氟米特的准确性和完整性满分最高,而甲氨蝶呤、柳氮磺胺、环磷酰胺、霉酚酸酯和托法替尼在GPT 4模型中总分最高。然而,对于这两种模型,14种药物中的一种被评为错误多于正确。结论:在考虑妊娠期抗风湿药物的安全性和兼容性时,ChatGPT版本3.5和版本4均表现出令人满意的准确性和完整性。另一方面,研究表明,ChatGPT生成的回答也包含不准确的信息。尽管ChatGPT的性能很好,但由于该人工智能工具的局限性,它不应该作为一个独立的工具来决定怀孕期间是否服用药物。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
57
审稿时长
6 weeks
期刊最新文献
Factors associated with quality of life in caregivers of patients with multiple myeloma Revision rhinoplasty with free diced cartilage grafts: Outcome evaluations with the Nasal Obstruction Symptom Evaluation (NOSE) scale Evaluation of risk factors for anal human papillomavirus infection in heterosexual women diagnosed with human papillomavirus associated cervical dysplasia Self-reported occupational exposure and its association with sperm DNA fragmentation in infertile men Rates of upgrade to malignancy in surgical excision of intraductal papillomas of the breast: A retrospective cohort study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1