FPT-Former：利用基于视听专家知识的多模态测量方法识别抑郁症的灵活并行转换器

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Intelligent Systems Pub Date : 2024-01-29 DOI:10.1155/2024/1564574

Yifu Li, Xueping Yang, Meng Zhao, Zihao Wang, Yudong Yao, Wei Qian, Shouliang Qi

{"title":"FPT-Former：利用基于视听专家知识的多模态测量方法识别抑郁症的灵活并行转换器","authors":"Yifu Li, Xueping Yang, Meng Zhao, Zihao Wang, Yudong Yao, Wei Qian, Shouliang Qi","doi":"10.1155/2024/1564574","DOIUrl":null,"url":null,"abstract":"Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.","PeriodicalId":14089,"journal":{"name":"International Journal of Intelligent Systems","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures\",\"authors\":\"Yifu Li, Xueping Yang, Meng Zhao, Zihao Wang, Yudong Yao, Wei Qian, Shouliang Qi\",\"doi\":\"10.1155/2024/1564574\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.\",\"PeriodicalId\":14089,\"journal\":{\"name\":\"International Journal of Intelligent Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-01-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1155/2024/1564574\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1155/2024/1564574","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

背景和目的。目前，抑郁症是一个普遍的全球性问题，给个人、家庭和社会带来了巨大的负担和残疾。通过从视听数据中提取线索并做出诊断，深度学习（DL）已成为自动检测抑郁症的一种有价值的方法。在临床研究中，PHQ-8 被认为是一种有效的抑郁障碍诊断工具，本实验的目的是提高 PHQ-8 预测的准确性。此外，本文还旨在证明专家知识在抑郁症诊断中的有效性，并讨论一种新颖的多模态网络架构。研究方法本文的研究重点是多模态抑郁分析，提出了一种能从三种不同模态（即一个视频和两个音频描述符）中提取数据的灵活并行变换器（FPT）模型。FPT-Former 模型包含三条路径，每条路径都使用一种模式中基于专家知识的描述符作为输入。转换器模块的编码器部分将这些描述符表示成 32 个特征，然后将这些特征融合起来，实现 PHQ-8 分数的最终回归。扩展的心理困扰分析访谈语料库（E-DAIC）是 WOZ-DAIC 的扩展，由半临床访谈组成，旨在帮助诊断心理困扰状况。它包含 275 个参与者的样本量，在本研究中，它被用来以 10 倍交叉验证的方式测试模型。结果。本文介绍的 FPT 与最先进的作品性能相当，均方根误差 (RMSE) 为 4.80，平均绝对误差 (MAE) 为 4.58。消融实验表明，三模态融合模型优于其他双模态融合模型和单模态模型。当 PHQ-8 评分阈值为 10 时，抑郁分类的准确率为 0.79。结论利用基于专家知识的多模态测量和并行变换器结构的优势，FPT 模型在抑郁症检测方面表现出了良好的性能。该模型通过音频和视频提高了抑郁症诊断的准确性，同时也证明了利用专家知识诊断抑郁症的有效性。灵活的结构、较高的预测效率和安全的隐私保护使我们的模型成为精神医疗领域可推广的智能系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Intelligent Systems 工程技术-计算机：人工智能

CiteScore

11.30

自引率

14.30%

发文量

304

审稿时长

9 months

期刊介绍： The International Journal of Intelligent Systems serves as a forum for individuals interested in tapping into the vast theories based on intelligent systems construction. With its peer-reviewed format, the journal explores several fascinating editorials written by today''s experts in the field. Because new developments are being introduced each day, there''s much to be learned — examination, analysis creation, information retrieval, man–computer interactions, and more. The International Journal of Intelligent Systems uses charts and illustrations to demonstrate these ground-breaking issues, and encourages readers to share their thoughts and experiences.