Multimodal depression detection based on an attention graph convolution and transformer.

IF 2.6 4区工程技术 Q1 Mathematics Mathematical Biosciences and Engineering Pub Date : 2025-02-27 DOI:10.3934/mbe.2025024

Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He

{"title":"Multimodal depression detection based on an attention graph convolution and transformer.","authors":"Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He","doi":"10.3934/mbe.2025024","DOIUrl":null,"url":null,"abstract":"<p><p>Traditional depression detection methods typically rely on single-modal data, but these approaches are limited by individual differences, noise interference, and emotional fluctuations. To address the low accuracy in single-modal depression detection and the poor fusion of multimodal features from electroencephalogram (EEG) and speech signals, we have proposed a multimodal depression detection model based on EEG and speech signals, named the multi-head attention-GCN_ViT (MHA-GCN_ViT). This approach leverages deep learning techniques, including graph convolutional networks (GCN) and vision transformers (ViT), to effectively extract and fuse the frequency-domain features and spatiotemporal characteristics of EEG signals with the frequency-domain features of speech signals. First, a discrete wavelet transform (DWT) was used to extract wavelet features from 29 channels of EEG signals. These features serve as node attributes for the construction of a feature matrix, calculating the Pearson correlation coefficient between channels, from which an adjacency matrix is constructed to represent the brain network structure. This structure was then fed into a graph convolutional network (GCN) for deep feature learning. A multi-head attention mechanism was introduced to enhance the GCN's capability in representing brain networks. Using a short-time Fourier transform (STFT), we extracted 2D spectral features of EEG signals and mel spectrogram features of speech signals. Both were further processed using a vision transformer (ViT) to obtain deep features. Finally, the multiple features from EEG and speech spectrograms were fused at the decision level for depression classification. A five-fold cross-validation on the MODMA dataset demonstrated the model's accuracy, precision, recall, and F1 score of 89.03%, 90.16%, 89.04%, and 88.83%, respectively, indicating a significant improvement in the performance of multimodal depression detection. Furthermore, MHA-GCN_ViT demonstrated robust performance in depression detection and exhibited broad applicability, with potential for extension to multimodal detection tasks in other psychological and neurological disorders.</p>","PeriodicalId":49870,"journal":{"name":"Mathematical Biosciences and Engineering","volume":"22 3","pages":"652-676"},"PeriodicalIF":2.6000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biosciences and Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3934/mbe.2025024","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Traditional depression detection methods typically rely on single-modal data, but these approaches are limited by individual differences, noise interference, and emotional fluctuations. To address the low accuracy in single-modal depression detection and the poor fusion of multimodal features from electroencephalogram (EEG) and speech signals, we have proposed a multimodal depression detection model based on EEG and speech signals, named the multi-head attention-GCN_ViT (MHA-GCN_ViT). This approach leverages deep learning techniques, including graph convolutional networks (GCN) and vision transformers (ViT), to effectively extract and fuse the frequency-domain features and spatiotemporal characteristics of EEG signals with the frequency-domain features of speech signals. First, a discrete wavelet transform (DWT) was used to extract wavelet features from 29 channels of EEG signals. These features serve as node attributes for the construction of a feature matrix, calculating the Pearson correlation coefficient between channels, from which an adjacency matrix is constructed to represent the brain network structure. This structure was then fed into a graph convolutional network (GCN) for deep feature learning. A multi-head attention mechanism was introduced to enhance the GCN's capability in representing brain networks. Using a short-time Fourier transform (STFT), we extracted 2D spectral features of EEG signals and mel spectrogram features of speech signals. Both were further processed using a vision transformer (ViT) to obtain deep features. Finally, the multiple features from EEG and speech spectrograms were fused at the decision level for depression classification. A five-fold cross-validation on the MODMA dataset demonstrated the model's accuracy, precision, recall, and F1 score of 89.03%, 90.16%, 89.04%, and 88.83%, respectively, indicating a significant improvement in the performance of multimodal depression detection. Furthermore, MHA-GCN_ViT demonstrated robust performance in depression detection and exhibited broad applicability, with potential for extension to multimodal detection tasks in other psychological and neurological disorders.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Mathematical Biosciences and Engineering 工程技术-数学跨学科应用

CiteScore

3.90

自引率

7.70%

发文量

586

审稿时长

>12 weeks

期刊介绍： Mathematical Biosciences and Engineering (MBE) is an interdisciplinary Open Access journal promoting cutting-edge research, technology transfer and knowledge translation about complex data and information processing. MBE publishes Research articles (long and original research); Communications (short and novel research); Expository papers; Technology Transfer and Knowledge Translation reports (description of new technologies and products); Announcements and Industrial Progress and News (announcements and even advertisement, including major conferences).