{"title":"Cross-corpus analysis for acoustic recognition of negative interactions","authors":"I. Lefter, H. Nefs, C. Jonker, L. Rothkrantz","doi":"10.1109/ACII.2015.7344562","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed a growing interest in recognizing emotions and events based on speech. One of the applications of such systems is automatically detecting when a situations gets out of hand and human intervention is needed. Most studies have focused on increasing recognition accuracies using parts of the same dataset for training and testing. However, this says little about how such a trained system is expected to perform `in the wild'. In this paper we present a cross-corpus study using the audio part of three multimodal datasets containing negative human-human interactions. We present intra- and cross-corpus accuracies whilst manipulating the acoustic features, normalization schemes, and oversampling of the least represented class to alleviate the negative effects of data unbalance. We observe a decrease in performance when disjunct corpora are used for training and testing. Merging two datasets for training results in a slightly lower performance than the best one obtained by using only one corpus for training. A hand crafted low dimensional feature set shows competitive behavior when compared to a brute force high dimensional features vector. Corpus normalization and artificially creating samples of the sparsest class have a positive effect.","PeriodicalId":6863,"journal":{"name":"2015 International Conference on Affective Computing and Intelligent Interaction (ACII)","volume":"48 1","pages":"132-138"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Affective Computing and Intelligent Interaction (ACII)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACII.2015.7344562","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
Recent years have witnessed a growing interest in recognizing emotions and events based on speech. One of the applications of such systems is automatically detecting when a situations gets out of hand and human intervention is needed. Most studies have focused on increasing recognition accuracies using parts of the same dataset for training and testing. However, this says little about how such a trained system is expected to perform `in the wild'. In this paper we present a cross-corpus study using the audio part of three multimodal datasets containing negative human-human interactions. We present intra- and cross-corpus accuracies whilst manipulating the acoustic features, normalization schemes, and oversampling of the least represented class to alleviate the negative effects of data unbalance. We observe a decrease in performance when disjunct corpora are used for training and testing. Merging two datasets for training results in a slightly lower performance than the best one obtained by using only one corpus for training. A hand crafted low dimensional feature set shows competitive behavior when compared to a brute force high dimensional features vector. Corpus normalization and artificially creating samples of the sparsest class have a positive effect.