Multimodal Emotion Recognition in Conversations aims to determine utterance-level emotions robustly when heterogeneous textual, acoustic, and visual signals intertwine and the dialogue context evolves across turns. Although graph-based dialogue methods have made progress, decision calibration, geometric alignment, and class-level organization are often modeled in isolation when modality conflicts coexist with cross-turn context shifts. This promotes information diffusion and structural redundancy, thereby hampering the separability of weakly distinguishable emotions and overall robustness. To address these issues, we introduce Graph-Prototype Distillation with Prototype-Guided Contrastive Training (GPGC), which jointly constrains representation alignment, distributional consistency, and prototype alignment on a unified intra-modal graph-aggregated representation, thereby tightening intra-class dispersion from both probabilistic and geometric perspectives and stabilizing class-prototype directions. Prototype-guided momentum contrast is further employed to leverage a cross-batch stable dictionary and guided positives to consistently enlarge margins against hard negatives while reducing the interference of noisy samples during optimization. Systematic evaluations on two widely used MERC benchmarks and an in-the-wild multimodal sentiment benchmark demonstrate consistent improvements in both overall performance and stability.
扫码关注我们
求助内容:
应助结果提醒方式:
