Learning an effective joint representation is fundamental for Multimodal Sentiment Analysis (MSA). Existing studies typically adopt complex networks to construct joint multimodal representations directly, yet often overlook the heterogeneity among different modalities as well as the preservation of modality-specific information. Moreover, current methods tend to treat all modalities equally, failing to exploit the rich emotional cues in the text modality. To address these issues, we propose a Text-based Parallel Interaction Network (TPIN) that aims to trade off the commonality and specificity of different modalities. The TPIN consists of two components: Modality-Common Information Processing (MCIP) and Modality-Specific Information Processing (MSIP). In MCIP, we innovatively propose a contrastive learning algorithm with Hard Negative Mining (HNM), which is integrated into our designed Two-Stage Contrastive Learning (TSCL) to mitigate inter-modal heterogeneity. Additionally, we design a Text-Guided Dynamic Semantic Aggregation (TG-DSA) module to enable deep multimodal fusion under the guidance of text modality. In MSIP, we devise a dynamic routing mechanism, which iteratively optimizes routing weights to better capture modality-specific information in visual and acoustic modalities. Experimental results demonstrate that our method achieves state-of-the-art performance on both the CMU-MOSI and CMU-MOSEI datasets, showing consistent gains of 0.5%–1.2% across major evaluation metrics compared with recent advanced models.
扫码关注我们
求助内容:
应助结果提醒方式:
