Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.

IF 4.8 2区 医学 Q1 PSYCHIATRY Jmir Mental Health Pub Date : 2024-07-23 DOI:10.2196/57306
Prottay Kumar Adhikary, Aseem Srivastava, Shivani Kumar, Salam Michael Singh, Puneet Manuja, Jini K Gopinath, Vijay Krishnan, Swati Kedia Gupta, Koushik Sinha Deb, Tanmoy Chakraborty
{"title":"Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.","authors":"Prottay Kumar Adhikary, Aseem Srivastava, Shivani Kumar, Salam Michael Singh, Puneet Manuja, Jini K Gopinath, Vijay Krishnan, Swati Kedia Gupta, Koushik Sinha Deb, Tanmoy Chakraborty","doi":"10.2196/57306","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions.</p><p><strong>Objective: </strong>This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance.</p><p><strong>Methods: </strong>We first created Mental Health Counseling-Component-Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component-guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals.</p><p><strong>Results: </strong>Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics.</p><p><strong>Conclusions: </strong>While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice.</p>","PeriodicalId":48616,"journal":{"name":"Jmir Mental Health","volume":"11 ","pages":"e57306"},"PeriodicalIF":4.8000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11303879/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jmir Mental Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/57306","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions.

Objective: This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance.

Methods: We first created Mental Health Counseling-Component-Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component-guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals.

Results: Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics.

Conclusions: While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索大语言模型在总结心理健康咨询会话中的功效:基准研究。
背景:全面的疗程总结可以有效地保持心理健康咨询的连续性,有助于制定明智的治疗计划。然而,人工总结是一项巨大的挑战,会转移专家对核心咨询过程的注意力。利用自动总结技术的进步来简化总结过程可以解决这个问题,因为这可以让心理健康专家获得冗长治疗过程的简明总结,从而提高他们的工作效率。然而,现有的方法往往忽视了心理咨询互动中固有的细微复杂性:本研究评估了最先进的大语言模型(LLMs)通过基于方面的总结有选择性地总结治疗过程的各个部分的有效性,旨在为其性能设定基准:我们首先创建了 "心理健康咨询-成分引导对话摘要",这是一个基准数据集,由 191 个咨询会话组成,摘要集中于 3 个不同的咨询成分(也称为咨询方面)。接下来,我们评估了 11 种最先进的 LLM 在处理咨询成分引导总结任务方面的能力。我们使用标准摘要指标对生成的摘要进行了定量评估,并由心理健康专业人员对其进行了定性验证:我们的研究结果表明,在心理咨询内容的各个方面,使用标准定量指标(如面向回忆的摘要评估(ROUGE)-1、ROUGE-2、ROUGE-L 和来自转换器的双向编码器表征得分)对特定任务 LLMs(如 MentalLlama、Mistral 和 MentalBART)进行评估时,它们都表现出了卓越的性能。此外,专家评估显示,Mistral 在情感态度、负担、伦理性、一致性、机会成本和感知有效性这 6 个参数上优于 MentalLlama 和 MentalBART。不过,这些模型都有一个共同的弱点,即在机会成本和感知有效性指标上还有改进的余地:结论:虽然根据自动评估分数对专门针对心理健康领域数据进行微调的 LLM 显示出更好的性能,但专家评估表明,这些模型在临床应用中还不可靠。在实际应用之前,有必要对其进行进一步的完善和验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Jmir Mental Health
Jmir Mental Health Medicine-Psychiatry and Mental Health
CiteScore
10.80
自引率
3.80%
发文量
104
审稿时长
16 weeks
期刊介绍: JMIR Mental Health (JMH, ISSN 2368-7959) is a PubMed-indexed, peer-reviewed sister journal of JMIR, the leading eHealth journal (Impact Factor 2016: 5.175). JMIR Mental Health focusses on digital health and Internet interventions, technologies and electronic innovations (software and hardware) for mental health, addictions, online counselling and behaviour change. This includes formative evaluation and system descriptions, theoretical papers, review papers, viewpoint/vision papers, and rigorous evaluations.
期刊最新文献
Evaluation of a Guided Chatbot Intervention for Young People in Jordan: Feasibility Randomized Controlled Trial. Evaluating the Effectiveness of InsightApp for Anxiety, Valued Action, and Psychological Resilience: Longitudinal Randomized Controlled Trial. Exploring the Differentiation of Self-Concepts in the Physical and Virtual Worlds Using Euclidean Distance Analysis and Its Relationship With Digitalization and Mental Health Among Young People: Cross-Sectional Study. Testing the Feasibility, Acceptability, and Potential Efficacy of an Innovative Digital Mental Health Care Delivery Model Designed to Increase Access to Care: Open Trial of the Digital Clinic. Role of Tailored Timing and Frequency Prompts on the Efficacy of an Internet-Delivered Stress Recovery Intervention for Health Care Workers: Randomized Controlled Trial.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1