Anna Derington;Hagen Wierstorf;Ali Özkil;Florian Eyben;Felix Burkhardt;Björn W. Schuller
{"title":"Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models","authors":"Anna Derington;Hagen Wierstorf;Ali Özkil;Florian Eyben;Felix Burkhardt;Björn W. Schuller","doi":"10.1109/TAFFC.2025.3547218","DOIUrl":null,"url":null,"abstract":"Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts – such as text sentiment –, and differ in terms of fairness.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1929-1941"},"PeriodicalIF":9.8000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10908859/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts – such as text sentiment –, and differ in terms of fairness.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.