Research into automatic recognition and prediction of depression from behavioural signals like speech and facial video represents an exciting mix of opportunity and challenge. The opportunity comes from the huge prevalence of depression worldwide and the fact that clinicians already explicitly or implicitly account for observable behaviour in their assessments. The challenge comes from the multi-factorial nature of depression, and the complexity of behavioural signals, which convey several other important types of information as well as depression. Investigations in our group to date have revealed some interesting perspectives on how to deal with confounding effects (e.g. due to speaker identity) and the role of depression-related signal variability. This presentation will focus on how depression is manifested in the speech signal, how to model depression in speech, methods for mitigating unwanted variability in speech, how depression assessment is different from more mainstream affective computing, what is needed from depression databases, and different possible system designs and applications. A range of fertile areas for future research will be suggested.
{"title":"Automatic Assessment of Depression from Speech and Behavioural Signals","authors":"J. Epps","doi":"10.1145/2661806.2661820","DOIUrl":"https://doi.org/10.1145/2661806.2661820","url":null,"abstract":"Research into automatic recognition and prediction of depression from behavioural signals like speech and facial video represents an exciting mix of opportunity and challenge. The opportunity comes from the huge prevalence of depression worldwide and the fact that clinicians already explicitly or implicitly account for observable behaviour in their assessments. The challenge comes from the multi-factorial nature of depression, and the complexity of behavioural signals, which convey several other important types of information as well as depression. Investigations in our group to date have revealed some interesting perspectives on how to deal with confounding effects (e.g. due to speaker identity) and the role of depression-related signal variability. This presentation will focus on how depression is manifested in the speech signal, how to model depression in speech, methods for mitigating unwanted variability in speech, how depression assessment is different from more mainstream affective computing, what is needed from depression databases, and different possible system designs and applications. A range of fertile areas for future research will be suggested.","PeriodicalId":318508,"journal":{"name":"AVEC '14","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126701934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an enormous number of potential applications of the system which is capable to recognize human emotions. Such opportunity can be useful in various applications, e.g., improvement of Spoken Dialogue Systems (SDSs) or monitoring agents in call-centers. Depression is another aspect of human beings which is closely related to emotions. The system, that can automatically diagnose patient's depression can be helpful to physicians in order to support their decisions and avoid critical mistakes. Therefore, the Affect and Depression Recognition Sub-Challenges (ASC and DSC correspondingly) of the second combined open Audio/Visual Emotion and Depression recognition Challenge (AVEC 2014) is focused on estimating emotions and depression. This study presents the results of multimodal affect and depression recognition based on four different segmentation methods, using support vector regression. Furthermore, a speaker identification procedure has been introduced in order to build the speaker-specific emotion/depression recognition systems.
{"title":"Emotion Recognition and Depression Diagnosis by Acoustic and Visual Features: A Multimodal Approach","authors":"M. Sidorov, W. Minker","doi":"10.1145/2661806.2661816","DOIUrl":"https://doi.org/10.1145/2661806.2661816","url":null,"abstract":"There is an enormous number of potential applications of the system which is capable to recognize human emotions. Such opportunity can be useful in various applications, e.g., improvement of Spoken Dialogue Systems (SDSs) or monitoring agents in call-centers. Depression is another aspect of human beings which is closely related to emotions. The system, that can automatically diagnose patient's depression can be helpful to physicians in order to support their decisions and avoid critical mistakes. Therefore, the Affect and Depression Recognition Sub-Challenges (ASC and DSC correspondingly) of the second combined open Audio/Visual Emotion and Depression recognition Challenge (AVEC 2014) is focused on estimating emotions and depression. This study presents the results of multimodal affect and depression recognition based on four different segmentation methods, using support vector regression. Furthermore, a speaker identification procedure has been introduced in order to build the speaker-specific emotion/depression recognition systems.","PeriodicalId":318508,"journal":{"name":"AVEC '14","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122532738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents our work on ACM MM Audio Visual Emotion Corpus 2014 (AVEC 2014) using the baseline features in accordance with the challenge protocol. For prediction, we use Canonical Correlation Analysis (CCA) in affect sub-challenge (ASC) and Moore-Penrose generalized inverse (MPGI) in depression sub-challenge (DSC). The video baseline provides histograms of Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) features. Based on our preliminary experiments on AVEC 2013 challenge data, we focus on the inner facial regions that correspond to eyes and mouth area. We obtain an ensemble of regional linear regressors via CCA and MPGI. We also enrich the 2014 baseline set with Local Phase Quantization (LPQ) features extracted using Intraface toolkit detected/tracked faces. Combining both representations in a CCA ensemble approach, on the challenge test set we reach an average Pearson's Correlation Coefficient (PCC) of 0.3932, outperforming the ASC test set baseline PCC of 0.1966. On the DSC, combining modality specific MPGI based ensemble systems, we reach 9.61 Root Mean Square Error (RMSE).
{"title":"Ensemble CCA for Continuous Emotion Prediction","authors":"Heysem Kaya, Fazilet Çilli, A. A. Salah","doi":"10.1145/2661806.2661814","DOIUrl":"https://doi.org/10.1145/2661806.2661814","url":null,"abstract":"This paper presents our work on ACM MM Audio Visual Emotion Corpus 2014 (AVEC 2014) using the baseline features in accordance with the challenge protocol. For prediction, we use Canonical Correlation Analysis (CCA) in affect sub-challenge (ASC) and Moore-Penrose generalized inverse (MPGI) in depression sub-challenge (DSC). The video baseline provides histograms of Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) features. Based on our preliminary experiments on AVEC 2013 challenge data, we focus on the inner facial regions that correspond to eyes and mouth area. We obtain an ensemble of regional linear regressors via CCA and MPGI. We also enrich the 2014 baseline set with Local Phase Quantization (LPQ) features extracted using Intraface toolkit detected/tracked faces. Combining both representations in a CCA ensemble approach, on the challenge test set we reach an average Pearson's Correlation Coefficient (PCC) of 0.3932, outperforming the ASC test set baseline PCC of 0.1966. On the DSC, combining modality specific MPGI based ensemble systems, we reach 9.61 Root Mean Square Error (RMSE).","PeriodicalId":318508,"journal":{"name":"AVEC '14","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126501045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate the use of two visual descriptors: Local Binary Patterns-Three Orthogonal Planes(LBP-TOP) and Dense Trajectories for depression assessment on the AVEC 2014 challenge dataset. We encode the visual information generated by the two descriptors using Fisher Vector encoding which has been shown to be one of the best performing methods to encode visual data for image classification. We also incorporate audio features in the final system to introduce multiple input modalities. The results produced using Linear Support Vector regression outperform the baseline method.
{"title":"Depression Estimation Using Audiovisual Features and Fisher Vector Encoding","authors":"V. Jain, J. Crowley, A. Dey, A. Lux","doi":"10.1145/2661806.2661817","DOIUrl":"https://doi.org/10.1145/2661806.2661817","url":null,"abstract":"We investigate the use of two visual descriptors: Local Binary Patterns-Three Orthogonal Planes(LBP-TOP) and Dense Trajectories for depression assessment on the AVEC 2014 challenge dataset. We encode the visual information generated by the two descriptors using Fisher Vector encoding which has been shown to be one of the best performing methods to encode visual data for image classification. We also incorporate audio features in the final system to introduce multiple input modalities. The results produced using Linear Support Vector regression outperform the baseline method.","PeriodicalId":318508,"journal":{"name":"AVEC '14","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121583559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}