Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.
扫码关注我们
求助内容:
应助结果提醒方式:
