{"title":"A Novel Hybrid Attention-Based Dilated Network for Depression Classification Model from Multimodal Data Using Improved Heuristic Approach","authors":"B. Manjulatha, Suresh Pabboju","doi":"10.1142/s0219467826500105","DOIUrl":null,"url":null,"abstract":"Automatic depression classification from multimodal input data is a challenging task. Modern methods use paralinguistic information such as audio and video signals. Using linguistic information such as speech signals and text data for depression classification is a complicated task in deep learning models. Best audio and video features are built to produce a dependable depression classification system. Textual signals related to depression classification are analyzed using text-based content data. Moreover, to increase the achievements of the depression classification system, audio, visual, and text descriptors are used. So, a deep learning-based depression classification model is developed to detect the person with depression from multimodal data. The EEG signal, Speech signal, video, and text are gathered from standard databases. Four stages of feature extraction take place. In the first stage, the features from the decomposed EEG signals are attained by the empirical mode decomposition (EMD) method, and features are extracted by means of linear and nonlinear feature extraction. In the second stage, the spectral features of the speech signals from the Mel-frequency cepstral coefficients (MFCC) are extracted. In the third stage, the facial texture features from the input video are extracted. In the fourth stage of feature extraction, the input text data are pre-processed, and from the pre-processed data, the textual features are extracted by using the Transformer Net. All four sets of features are optimally selected and combined with the optimal weights to get the weighted fused features using the enhanced mountaineering team-based optimization algorithm (EMTOA). The optimal weighted fused features are finally given to the hybrid attention-based dilated network (HADN). The HDAN is developed by combining temporal convolutional network (TCN) with bidirectional long short-term memory (Bi-LSTM). The parameters in the HDAN are optimized with the assistance of the developed EMTOA algorithm. At last, the classified output of depression is obtained from the HDAN. The efficiency of the developed deep learning HDAN is validated by comparing it with various traditional classification models.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467826500105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic depression classification from multimodal input data is a challenging task. Modern methods use paralinguistic information such as audio and video signals. Using linguistic information such as speech signals and text data for depression classification is a complicated task in deep learning models. Best audio and video features are built to produce a dependable depression classification system. Textual signals related to depression classification are analyzed using text-based content data. Moreover, to increase the achievements of the depression classification system, audio, visual, and text descriptors are used. So, a deep learning-based depression classification model is developed to detect the person with depression from multimodal data. The EEG signal, Speech signal, video, and text are gathered from standard databases. Four stages of feature extraction take place. In the first stage, the features from the decomposed EEG signals are attained by the empirical mode decomposition (EMD) method, and features are extracted by means of linear and nonlinear feature extraction. In the second stage, the spectral features of the speech signals from the Mel-frequency cepstral coefficients (MFCC) are extracted. In the third stage, the facial texture features from the input video are extracted. In the fourth stage of feature extraction, the input text data are pre-processed, and from the pre-processed data, the textual features are extracted by using the Transformer Net. All four sets of features are optimally selected and combined with the optimal weights to get the weighted fused features using the enhanced mountaineering team-based optimization algorithm (EMTOA). The optimal weighted fused features are finally given to the hybrid attention-based dilated network (HADN). The HDAN is developed by combining temporal convolutional network (TCN) with bidirectional long short-term memory (Bi-LSTM). The parameters in the HDAN are optimized with the assistance of the developed EMTOA algorithm. At last, the classified output of depression is obtained from the HDAN. The efficiency of the developed deep learning HDAN is validated by comparing it with various traditional classification models.