High arousal speech is produced by speakers when they raise their loudness levels. There are deviations from neutral speech, especially in the excitation component of the speech production mechanism in the high arousal mode. In this study, a parameter, called the time-frequency spectral error (TFe) is derived using the single frequency filtering (SFF) spectrogram. It is used to characterize the high arousal regions in speech signals. The proposed parameter captures the fine temporal and spectral variations due to changes in the excitation source.
{"title":"Time-frequency spectral error for analysis of high arousal speech","authors":"P. Gangamohan, S. Gangashetty, B. Yegnanarayana","doi":"10.21437/SMM.2018-4","DOIUrl":"https://doi.org/10.21437/SMM.2018-4","url":null,"abstract":"High arousal speech is produced by speakers when they raise their loudness levels. There are deviations from neutral speech, especially in the excitation component of the speech production mechanism in the high arousal mode. In this study, a parameter, called the time-frequency spectral error (TFe) is derived using the single frequency filtering (SFF) spectrogram. It is used to characterize the high arousal regions in speech signals. The proposed parameter captures the fine temporal and spectral variations due to changes in the excitation source.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"75 s321","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113953560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emotional impact of Indian music on human listeners has been studied mainly with respect to ragas. Although this approach aligns with the traditional and musicological views, some studies show that raga-specific effects may not be consistent. In this paper, we propose an alternative method of study based on the components of Indian Classical Music, which may be viewed as consisting of constant-pitch notes (CPNs) provid-ing the context, and transients, the detail. One hundred concert pieces in four ragas each in Carnatic music (CM) and Hindustani music (HM) are analyzed to show that the transients are, on average, longer than CPNs. Further, the defined scale of the raga is not always mirrored in the CPNs for CM. We also draw upon the result that CPNs and transients scale non-uniformly when changing the tempo of CM pieces. Based on the observations and previous results on the emotional impact of the major and minor scales in Western music, we propose that the effect of CPNs and transients should be analyzed separately. We present a preliminary experiment that brings outs related challenges.
{"title":"A component-based approach to study the effect of Indian music on emotions","authors":"V. Viraraghavan, A. Pal, H. Murthy, R. Aravind","doi":"10.21437/SMM.2018-7","DOIUrl":"https://doi.org/10.21437/SMM.2018-7","url":null,"abstract":"The emotional impact of Indian music on human listeners has been studied mainly with respect to ragas. Although this approach aligns with the traditional and musicological views, some studies show that raga-specific effects may not be consistent. In this paper, we propose an alternative method of study based on the components of Indian Classical Music, which may be viewed as consisting of constant-pitch notes (CPNs) provid-ing the context, and transients, the detail. One hundred concert pieces in four ragas each in Carnatic music (CM) and Hindustani music (HM) are analyzed to show that the transients are, on average, longer than CPNs. Further, the defined scale of the raga is not always mirrored in the CPNs for CM. We also draw upon the result that CPNs and transients scale non-uniformly when changing the tempo of CM pieces. Based on the observations and previous results on the emotional impact of the major and minor scales in Western music, we propose that the effect of CPNs and transients should be analyzed separately. We present a preliminary experiment that brings outs related challenges.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115339411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Sarma, Rohan Kumar Das, Abhishek Dey, Risto Haukioja
The classification of emotional speech is a challenging task and it depends critically on the correctness of labeled data. Most of the databases used for research purposes are either acted or simulated. Annotation of such acted database is easier as the actor exaggerates the emotions. On the other hand, emotion labeling on real-world data is very difficult due to confusion among the emotion classes. Another problem in such scenario is the class imbalance, because most of the data is found to be neutral in realistic environment. In this study, we perform emotion labeling on realistic data in a customized manner using emotion priority and confidence level. The annotated speech corpus is then used for analysis and study. Percentage distribution of different emotion classes in the real-world data and the confusions between the emotions during labeling are presented.
{"title":"Analysis of Speech Emotions in Realistic Environments","authors":"B. Sarma, Rohan Kumar Das, Abhishek Dey, Risto Haukioja","doi":"10.21437/smm.2018-3","DOIUrl":"https://doi.org/10.21437/smm.2018-3","url":null,"abstract":"The classification of emotional speech is a challenging task and it depends critically on the correctness of labeled data. Most of the databases used for research purposes are either acted or simulated. Annotation of such acted database is easier as the actor exaggerates the emotions. On the other hand, emotion labeling on real-world data is very difficult due to confusion among the emotion classes. Another problem in such scenario is the class imbalance, because most of the data is found to be neutral in realistic environment. In this study, we perform emotion labeling on realistic data in a customized manner using emotion priority and confidence level. The annotated speech corpus is then used for analysis and study. Percentage distribution of different emotion classes in the real-world data and the confusions between the emotions during labeling are presented.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115924913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper provides the classification of emotionally annotated speech of mentally impaired people. The main problem encoun-tered in the classification task is the class-imbalance. This imbalance is due to the availability of large number of speech samples for the neutral speech compared to other emotional speech. Different sampling methodologies are explored at the back-end to handle this class-imbalance problem. Mel-frequency cepstral coefficients (MFCCs) features are considered at the front-end, deep neural networks (DNNs) and gradient boosted decision trees (GBDT) are investigated at the back-end as classifiers. The experimental results obtained from the EmotAsS dataset have shown higher classification accuracy and Unweighted Average Recall (UAR) scores over the baseline system.
{"title":"Emotional Speech Classifier Systems: For Sensitive Assistance to support Disabled Individuals","authors":"V. V. Raju, P. Jain, K. Gurugubelli, A. Vuppala","doi":"10.21437/SMM.2018-2","DOIUrl":"https://doi.org/10.21437/SMM.2018-2","url":null,"abstract":"This paper provides the classification of emotionally annotated speech of mentally impaired people. The main problem encoun-tered in the classification task is the class-imbalance. This imbalance is due to the availability of large number of speech samples for the neutral speech compared to other emotional speech. Different sampling methodologies are explored at the back-end to handle this class-imbalance problem. Mel-frequency cepstral coefficients (MFCCs) features are considered at the front-end, deep neural networks (DNNs) and gradient boosted decision trees (GBDT) are investigated at the back-end as classifiers. The experimental results obtained from the EmotAsS dataset have shown higher classification accuracy and Unweighted Average Recall (UAR) scores over the baseline system.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116317089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Vinothkumar, Mari Ganesh Kumar, Abhishek Kumar, H. Gupta, S. SaranyaM, M. Sur, H. Murthy
Recent studies have shown that task-specific electroencephalography (EEG) can be used as a reliable biometric. This paper extends this study to task-independent EEG with auditory stimuli. Data collected from 40 subjects in response to various types of audio stimuli, using a 128 channel EEG system is presented to different classifiers, namely, k-nearest neighbor (k-NN), arti-ficial neural network (ANN) and universal background model - Gaussian mixture model (UBM-GMM). It is observed that k-NN and ANN perform well when testing is performed intrasession, while UBM-GMM framework is more robust when testing is performed intersession. This can be attributed to the fact that the correspondence of the sensor locations across sessions is only approximate. It is also observed that EEG from parietal and temporal regions contain more subject information although the performance using all the 128 channel data is marginally better.
{"title":"Task-Independent EEG based Subject Identification using Auditory Stimulus","authors":"D. Vinothkumar, Mari Ganesh Kumar, Abhishek Kumar, H. Gupta, S. SaranyaM, M. Sur, H. Murthy","doi":"10.21437/SMM.2018-6","DOIUrl":"https://doi.org/10.21437/SMM.2018-6","url":null,"abstract":"Recent studies have shown that task-specific electroencephalography (EEG) can be used as a reliable biometric. This paper extends this study to task-independent EEG with auditory stimuli. Data collected from 40 subjects in response to various types of audio stimuli, using a 128 channel EEG system is presented to different classifiers, namely, k-nearest neighbor (k-NN), arti-ficial neural network (ANN) and universal background model - Gaussian mixture model (UBM-GMM). It is observed that k-NN and ANN perform well when testing is performed intrasession, while UBM-GMM framework is more robust when testing is performed intersession. This can be attributed to the fact that the correspondence of the sensor locations across sessions is only approximate. It is also observed that EEG from parietal and temporal regions contain more subject information although the performance using all the 128 channel data is marginally better.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122006855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identification of emotions from human speech can be attempted by focusing upon three aspects of emotional speech: valence, arousal and dominance. In this paper, changes in the production characteristics of emotional speech are examined to discriminate between the high-arousal and low-arousal emotions, and amongst emotions within each of these categories. Basic emotions anger, happy and fear are examined in high-arousal, and neutral speech and sad emotion in low-arousal emotional speech. Discriminating changes are examined first in the excitation source characteristics, i.e., instantaneous fundamental frequency (F0) derived using the zero-frequency filtering (ZFF) method. Differences observed in the spectrograms are then validated by examining changes in the combined characteristics of the source and the vocal tract filter, i.e., strength of excitation (SoE), derived using ZFF method, and signal energy features. Emotions within each category are distinguished by examining changes in two scarcely explored discriminating features, namely, zero-crossing rate and the ratios amongst the spectral sub-band energies computed using short-time Fourier transform. Effectiveness of these features in discriminating emotions is validated using two emotion databases, Berlin EMO-DB (German) and IIT-KGP-SESC (Telugu). Proposed features exhibit highly encouraging results in discriminating these emotions. This study can be helpful towards automatic classification of emotions from speech.
{"title":"Discriminating between High-Arousal and Low-Arousal Emotional States of Mind using Acoustic Analysis","authors":"Esther Ramdinmawii, V. K. Mittal","doi":"10.21437/SMM.2018-1","DOIUrl":"https://doi.org/10.21437/SMM.2018-1","url":null,"abstract":"Identification of emotions from human speech can be attempted by focusing upon three aspects of emotional speech: valence, arousal and dominance. In this paper, changes in the production characteristics of emotional speech are examined to discriminate between the high-arousal and low-arousal emotions, and amongst emotions within each of these categories. Basic emotions anger, happy and fear are examined in high-arousal, and neutral speech and sad emotion in low-arousal emotional speech. Discriminating changes are examined first in the excitation source characteristics, i.e., instantaneous fundamental frequency (F0) derived using the zero-frequency filtering (ZFF) method. Differences observed in the spectrograms are then validated by examining changes in the combined characteristics of the source and the vocal tract filter, i.e., strength of excitation (SoE), derived using ZFF method, and signal energy features. Emotions within each category are distinguished by examining changes in two scarcely explored discriminating features, namely, zero-crossing rate and the ratios amongst the spectral sub-band energies computed using short-time Fourier transform. Effectiveness of these features in discriminating emotions is validated using two emotion databases, Berlin EMO-DB (German) and IIT-KGP-SESC (Telugu). Proposed features exhibit highly encouraging results in discriminating these emotions. This study can be helpful towards automatic classification of emotions from speech.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"2010 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, L. Devillers, B. Schmauch
In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.
{"title":"CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation","authors":"Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, L. Devillers, B. Schmauch","doi":"10.21437/SMM.2018-5","DOIUrl":"https://doi.org/10.21437/SMM.2018-5","url":null,"abstract":"In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.","PeriodicalId":158743,"journal":{"name":"Workshop on Speech, Music and Mind (SMM 2018)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114667222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}