Yao Xiao;Haoxin Ruan;Xujian Zhao;Peiquan Jin;Li Tian;Zihan Wei;Xuebo Cai;Yixin Wang;Liang Liu
{"title":"An Efficient Bi-Modal Fusion Framework for Music Emotion Recognition","authors":"Yao Xiao;Haoxin Ruan;Xujian Zhao;Peiquan Jin;Li Tian;Zihan Wei;Xuebo Cai;Yixin Wang;Liang Liu","doi":"10.1109/TAFFC.2024.3486340","DOIUrl":null,"url":null,"abstract":"Current methods for Music Emotion Recognition (MER) face challenges in effectively extracting features sensitive to emotions, especially those rich in temporal detail. Moreover, the narrow scope of music-related modalities impedes data integration from multiple sources, while including multiple modalities often leads to redundant information, which can degrade performance. To address these issues, we propose a lightweight framework for music emotion recognition that improves the extraction of features that are both sensitive to emotions and rich in temporal information and that integrates data from both audio and MIDI modalities while minimizing redundancy. Our approach involves developing two innovative unimodal encoders to learn embeddings from audio and MIDI-like features. Additionally, we introduce a Bi-modal Fusion Attention Model (BFAM) that integrates features from low-level to high-level semantic information across different modalities. Experimental evaluations on the EMOPIA and VGMIDI datasets show that our unimodal networks achieve accuracies that are 6.1% and 4.4% higher than baseline algorithms for MIDI and audio on the EMOPIA dataset, respectively. Furthermore, our BFAM achieves a 15.2% improvement in accuracy over the baseline, reaching 82.2%, which underscores its effectiveness for bi-modal MER applications.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 2","pages":"999-1015"},"PeriodicalIF":9.8000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10735097/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Current methods for Music Emotion Recognition (MER) face challenges in effectively extracting features sensitive to emotions, especially those rich in temporal detail. Moreover, the narrow scope of music-related modalities impedes data integration from multiple sources, while including multiple modalities often leads to redundant information, which can degrade performance. To address these issues, we propose a lightweight framework for music emotion recognition that improves the extraction of features that are both sensitive to emotions and rich in temporal information and that integrates data from both audio and MIDI modalities while minimizing redundancy. Our approach involves developing two innovative unimodal encoders to learn embeddings from audio and MIDI-like features. Additionally, we introduce a Bi-modal Fusion Attention Model (BFAM) that integrates features from low-level to high-level semantic information across different modalities. Experimental evaluations on the EMOPIA and VGMIDI datasets show that our unimodal networks achieve accuracies that are 6.1% and 4.4% higher than baseline algorithms for MIDI and audio on the EMOPIA dataset, respectively. Furthermore, our BFAM achieves a 15.2% improvement in accuracy over the baseline, reaching 82.2%, which underscores its effectiveness for bi-modal MER applications.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.