Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression

IF 3.9 2区医学 Q1 PSYCHIATRY Psychiatry Research Pub Date : 2025-06-01 Epub Date: 2025-04-15 DOI:10.1016/j.psychres.2025.116501

Yu Chang , Yi-Chun Liu , Si-Sheng Huang , Wen-Yu Hsu

{"title":"Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression","authors":"Yu Chang , Yi-Chun Liu , Si-Sheng Huang , Wen-Yu Hsu","doi":"10.1016/j.psychres.2025.116501","DOIUrl":null,"url":null,"abstract":"<div><div>The integration of chatbots into psychiatry introduces a novel approach to support clinical decision-making, but biases in their recommendations pose significant concerns. This study investigates potential biases in chatbot-generated recommendations for adjunctive therapy in difficult-to-treat depression, comparing these outputs with the Canadian Network for Mood and Anxiety Treatments (CANMAT) 2023 guidelines. The analysis involved calculating Cohen’s kappa coefficients to measure the overall level of agreement between chatbot-generated classifications and CANMAT guidelines. Differences between chatbot-generated and CANMAT classifications for each medication were assessed using the Wilcoxon signed-rank test. Results reveal substantial agreement for high-performing models, such as Google AI's Gemini 2.0 Flash, which achieved the highest Cohen’s kappa value of 0.82 (SE = 0.052). In contrast, OpenAI’s o1 model showed a lower agreement of 0.746 (SE = 0.057). Notable discrepancies were observed in the overestimation of medications such as quetiapine and lithium and the underestimation of modafinil and ketamine. Additionally, a distinct bias pattern was observed in OpenAI’s chatbots, which demonstrated a tendency to over-recommend lithium and bupropion. Our study highlights both the promise and the challenges of employing AI tools in psychiatric practice, and advocates for multi-model approaches to mitigate bias and improve clinical reliability.</div></div>","PeriodicalId":20819,"journal":{"name":"Psychiatry Research","volume":"348 ","pages":"Article 116501"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychiatry Research","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165178125001490","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/15 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

Abstract

The integration of chatbots into psychiatry introduces a novel approach to support clinical decision-making, but biases in their recommendations pose significant concerns. This study investigates potential biases in chatbot-generated recommendations for adjunctive therapy in difficult-to-treat depression, comparing these outputs with the Canadian Network for Mood and Anxiety Treatments (CANMAT) 2023 guidelines. The analysis involved calculating Cohen’s kappa coefficients to measure the overall level of agreement between chatbot-generated classifications and CANMAT guidelines. Differences between chatbot-generated and CANMAT classifications for each medication were assessed using the Wilcoxon signed-rank test. Results reveal substantial agreement for high-performing models, such as Google AI's Gemini 2.0 Flash, which achieved the highest Cohen’s kappa value of 0.82 (SE = 0.052). In contrast, OpenAI’s o1 model showed a lower agreement of 0.746 (SE = 0.057). Notable discrepancies were observed in the overestimation of medications such as quetiapine and lithium and the underestimation of modafinil and ketamine. Additionally, a distinct bias pattern was observed in OpenAI’s chatbots, which demonstrated a tendency to over-recommend lithium and bupropion. Our study highlights both the promise and the challenges of employing AI tools in psychiatric practice, and advocates for multi-model approaches to mitigate bias and improve clinical reliability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估人工智能驱动的精神病学建议的偏倚：聊天机器人分类和CANMAT 2023指南在难治性抑郁症辅助治疗中的比较横断面研究

将聊天机器人整合到精神病学中引入了一种支持临床决策的新方法，但其建议中的偏见引起了重大关注。本研究调查了聊天机器人对难治性抑郁症辅助治疗建议的潜在偏差，并将这些结果与加拿大情绪和焦虑治疗网络（CANMAT） 2023指南进行了比较。分析包括计算科恩的卡帕系数，以衡量聊天机器人生成的分类与CANMAT指南之间的总体一致程度。使用Wilcoxon符号秩检验评估每种药物的聊天机器人生成和CANMAT分类之间的差异。结果显示，高性能模型的一致性很高，例如谷歌AI的Gemini 2.0 Flash，其科恩kappa值最高，为0.82 （SE = 0.052）。相比之下，OpenAI的01模型的一致性较低，为0.746 （SE = 0.057）。在喹硫平和锂等药物的高估和莫达非尼和氯胺酮的低估方面观察到显著差异。此外，在OpenAI的聊天机器人中观察到一个明显的偏见模式，它显示出过度推荐锂和安非他酮的倾向。我们的研究强调了在精神病学实践中使用人工智能工具的前景和挑战，并倡导采用多模型方法来减轻偏见和提高临床可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Psychiatry Research 医学-精神病学

CiteScore

17.40

自引率

1.80%

发文量

527

审稿时长

57 days

期刊介绍： Psychiatry Research offers swift publication of comprehensive research reports and reviews within the field of psychiatry. The scope of the journal encompasses: Biochemical, physiological, neuroanatomic, genetic, neurocognitive, and psychosocial determinants of psychiatric disorders. Diagnostic assessments of psychiatric disorders. Evaluations that pursue hypotheses about the cause or causes of psychiatric diseases. Evaluations of pharmacologic and non-pharmacologic psychiatric treatments. Basic neuroscience studies related to animal or neurochemical models for psychiatric disorders. Methodological advances, such as instrumentation, clinical scales, and assays directly applicable to psychiatric research.