Pub Date : 2024-08-19DOI: 10.1109/TAI.2024.3445325
Tao Meng;Yuntao Shou;Wei Ai;Nan Yin;Keqin Li
The main task of multimodal emotion recognition in conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image, and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the class boundary enhanced representation learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of emotion categories in raw data. Second, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multitask graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning and achieve cross-modal emotion recognition. We have conducted extensive experiments on the interactive emotional dyadic motion capture (IEMOCAP) and multimodal emotion lines dataset (MELD) benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class “fear” and “disgust” emotion labels, our model improves the accuracy and F1 value by 10% to 20%. Our code is publicly available at https://github.com/yuntaoshou/CBERL