Minghui Zhao;Hongxiang Gao;Lulu Zhao;Zhongyu Wang;Fei Wang;Wenming Zheng;Jianqing Li;Chengyu Liu
{"title":"解耦多视角融合语音抑制检测","authors":"Minghui Zhao;Hongxiang Gao;Lulu Zhao;Zhongyu Wang;Fei Wang;Wenming Zheng;Jianqing Li;Chengyu Liu","doi":"10.1109/TAFFC.2025.3538519","DOIUrl":null,"url":null,"abstract":"<underline>S</u>peech <underline>D</u>epression <underline>D</u>etection (SDD) has garnered attention from researchers due to its low cost and convenience. However, current algorithms lack methods for extracting interpretable acoustic features based on clinical manifestations. In addition, effectively fusing these features to overcome individual heterogeneity remains a challenge. This study proposes a decoupled multi-perspective fusion (DMPF) model. The model extracts five key features of voiceprint, emotion, pause, energy, and tremor based on the multi-perspective clinical manifestations. These features are then decoupled into common and private features, which fused through graph attention network to obtain the comprehensive depression representation. Notably, this study has collected a depression speech dataset, which includes standardized and comprehensive tasks along with diagnostic labels provided by psychologists. Extensive subject-independent experiments were conducted on the DAIC-WOZ, MODMA and MPSC datasets. The voiceprint features can automatically cluster the depressed and non-depressed populations. Furthermore, DMPF can effectively fuse common and private features from different perspectives, achieving AUC of 84.20%, 85.34%, 86.13% on three datasets. The results illustrate the interpretability of multi-perspective features and demonstrate that the combination of speech manifestations can enhance the detection ability, which can provide a multi-perspective observational tool for physicians and clinical practice.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1772-1786"},"PeriodicalIF":11.3000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Decoupled Multi-Perspective Fusion for Speech Depression Detection\",\"authors\":\"Minghui Zhao;Hongxiang Gao;Lulu Zhao;Zhongyu Wang;Fei Wang;Wenming Zheng;Jianqing Li;Chengyu Liu\",\"doi\":\"10.1109/TAFFC.2025.3538519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<underline>S</u>peech <underline>D</u>epression <underline>D</u>etection (SDD) has garnered attention from researchers due to its low cost and convenience. However, current algorithms lack methods for extracting interpretable acoustic features based on clinical manifestations. In addition, effectively fusing these features to overcome individual heterogeneity remains a challenge. This study proposes a decoupled multi-perspective fusion (DMPF) model. The model extracts five key features of voiceprint, emotion, pause, energy, and tremor based on the multi-perspective clinical manifestations. These features are then decoupled into common and private features, which fused through graph attention network to obtain the comprehensive depression representation. Notably, this study has collected a depression speech dataset, which includes standardized and comprehensive tasks along with diagnostic labels provided by psychologists. Extensive subject-independent experiments were conducted on the DAIC-WOZ, MODMA and MPSC datasets. The voiceprint features can automatically cluster the depressed and non-depressed populations. Furthermore, DMPF can effectively fuse common and private features from different perspectives, achieving AUC of 84.20%, 85.34%, 86.13% on three datasets. The results illustrate the interpretability of multi-perspective features and demonstrate that the combination of speech manifestations can enhance the detection ability, which can provide a multi-perspective observational tool for physicians and clinical practice.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"1772-1786\"},\"PeriodicalIF\":11.3000,\"publicationDate\":\"2025-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10872825/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10872825/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Decoupled Multi-Perspective Fusion for Speech Depression Detection
Speech Depression Detection (SDD) has garnered attention from researchers due to its low cost and convenience. However, current algorithms lack methods for extracting interpretable acoustic features based on clinical manifestations. In addition, effectively fusing these features to overcome individual heterogeneity remains a challenge. This study proposes a decoupled multi-perspective fusion (DMPF) model. The model extracts five key features of voiceprint, emotion, pause, energy, and tremor based on the multi-perspective clinical manifestations. These features are then decoupled into common and private features, which fused through graph attention network to obtain the comprehensive depression representation. Notably, this study has collected a depression speech dataset, which includes standardized and comprehensive tasks along with diagnostic labels provided by psychologists. Extensive subject-independent experiments were conducted on the DAIC-WOZ, MODMA and MPSC datasets. The voiceprint features can automatically cluster the depressed and non-depressed populations. Furthermore, DMPF can effectively fuse common and private features from different perspectives, achieving AUC of 84.20%, 85.34%, 86.13% on three datasets. The results illustrate the interpretability of multi-perspective features and demonstrate that the combination of speech manifestations can enhance the detection ability, which can provide a multi-perspective observational tool for physicians and clinical practice.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.