Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan
{"title":"Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification","authors":"Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan","doi":"10.1016/j.specom.2024.103106","DOIUrl":null,"url":null,"abstract":"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103106"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000785","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.