Shreya G. Upadhyay;Luz Martinez-Lucas;William Katz;Carlos Busso;Chi-Chun Lee
{"title":"Phonetically-Anchored Domain Adaptation for Cross-Lingual Speech Emotion Recognition","authors":"Shreya G. Upadhyay;Luz Martinez-Lucas;William Katz;Carlos Busso;Chi-Chun Lee","doi":"10.1109/TAFFC.2025.3530105","DOIUrl":null,"url":null,"abstract":"The prevalence of cross-lingual <italic>speech emotion recognition</i> (SER) modeling has significantly increased due to its wide range of applications. Previous studies have primarily focused on technical strategies to adapt features, domains, and labels across languages, often overlooking the underlying commonalities between the languages. In this study, we address the language adaptation challenge in cross-lingual scenarios by incorporating vowel-phonetic constraints. Our approach is structured in two main parts. First, we investigate the vowel-phonetic commonalities associated with specific emotions across languages, particularly focusing on common vowels that prove to be valuable for SER modeling. Second, we utilize these identified common vowels as anchors to facilitate cross-lingual SER. To demonstrate the effectiveness of our approach, we conduct case studies using <italic>American English</i> and <italic>Taiwanese Mandarin</i> with two naturalistic emotional speech corpora: the MSP-Podcast and BIIC-Podcast corpora. The approach leverages evidence that certain vowels, including monophthongs and diphthongs, exhibit emotion-specific commonality across languages, serving as phonetic anchors to enhance unsupervised cross-lingual SER learning. The proposed model surpasses baseline performance, highlighting the importance of phonetic similarities for effective language adaptation in cross-lingual SER scenarios.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1631-1645"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10842508/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The prevalence of cross-lingual speech emotion recognition (SER) modeling has significantly increased due to its wide range of applications. Previous studies have primarily focused on technical strategies to adapt features, domains, and labels across languages, often overlooking the underlying commonalities between the languages. In this study, we address the language adaptation challenge in cross-lingual scenarios by incorporating vowel-phonetic constraints. Our approach is structured in two main parts. First, we investigate the vowel-phonetic commonalities associated with specific emotions across languages, particularly focusing on common vowels that prove to be valuable for SER modeling. Second, we utilize these identified common vowels as anchors to facilitate cross-lingual SER. To demonstrate the effectiveness of our approach, we conduct case studies using American English and Taiwanese Mandarin with two naturalistic emotional speech corpora: the MSP-Podcast and BIIC-Podcast corpora. The approach leverages evidence that certain vowels, including monophthongs and diphthongs, exhibit emotion-specific commonality across languages, serving as phonetic anchors to enhance unsupervised cross-lingual SER learning. The proposed model surpasses baseline performance, highlighting the importance of phonetic similarities for effective language adaptation in cross-lingual SER scenarios.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.