Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong
{"title":"Towards Multimodal Emotional Support Conversation Systems","authors":"Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong","doi":"arxiv-2408.03650","DOIUrl":null,"url":null,"abstract":"The integration of conversational artificial intelligence (AI) into mental\nhealth care promises a new horizon for therapist-client interactions, aiming to\nclosely emulate the depth and nuance of human conversations. Despite the\npotential, the current landscape of conversational AI is markedly limited by\nits reliance on single-modal data, constraining the systems' ability to\nempathize and provide effective emotional support. This limitation stems from a\npaucity of resources that encapsulate the multimodal nature of human\ncommunication essential for therapeutic counseling. To address this gap, we\nintroduce the Multimodal Emotional Support Conversation (MESC) dataset, a\nfirst-of-its-kind resource enriched with comprehensive annotations across text,\naudio, and video modalities. This dataset captures the intricate interplay of\nuser emotions, system strategies, system emotion, and system responses, setting\na new precedent in the field. Leveraging the MESC dataset, we propose a general\nSequential Multimodal Emotional Support framework (SMES) grounded in\nTherapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES\nframework incorporates an LLM-based reasoning model that sequentially generates\nuser emotion recognition, system strategy prediction, system emotion\nprediction, and response generation. Our rigorous evaluations demonstrate that\nthis framework significantly enhances the capability of AI systems to mimic\ntherapist behaviors with heightened empathy and strategic responsiveness. By\nintegrating multimodal data in this innovative manner, we bridge the critical\ngap between emotion recognition and emotional support, marking a significant\nadvancement in conversational AI for mental health support.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The integration of conversational artificial intelligence (AI) into mental
health care promises a new horizon for therapist-client interactions, aiming to
closely emulate the depth and nuance of human conversations. Despite the
potential, the current landscape of conversational AI is markedly limited by
its reliance on single-modal data, constraining the systems' ability to
empathize and provide effective emotional support. This limitation stems from a
paucity of resources that encapsulate the multimodal nature of human
communication essential for therapeutic counseling. To address this gap, we
introduce the Multimodal Emotional Support Conversation (MESC) dataset, a
first-of-its-kind resource enriched with comprehensive annotations across text,
audio, and video modalities. This dataset captures the intricate interplay of
user emotions, system strategies, system emotion, and system responses, setting
a new precedent in the field. Leveraging the MESC dataset, we propose a general
Sequential Multimodal Emotional Support framework (SMES) grounded in
Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES
framework incorporates an LLM-based reasoning model that sequentially generates
user emotion recognition, system strategy prediction, system emotion
prediction, and response generation. Our rigorous evaluations demonstrate that
this framework significantly enhances the capability of AI systems to mimic
therapist behaviors with heightened empathy and strategic responsiveness. By
integrating multimodal data in this innovative manner, we bridge the critical
gap between emotion recognition and emotional support, marking a significant
advancement in conversational AI for mental health support.