Multimodal Summarization (MS) generates high-quality summaries by integrating textual and visual information. However, existing MS research faces several challenges, including (1) ignoring fine-grained key information between visual and textual modalities and interaction with coarse-grained information, (2) cross-modal semantic inconsistency, which hinders alignment and fusion of visual and textual feature spaces, and (3) ignoring inherent heterogeneity of an image when filtering visual information, which causes excessive filtering or excessive retention. To address these issues, we propose Coarse-and-Fine Granularity Synergy and Region Counterfactual Reasoning Filter (CFCR) for MS. Specifically, we design Coarse-and-Fine Granularity Synergy (CFS) to capture both global (coarse-grained) and important detailed (fine-grained) information in text and image modalities. Based on this, we design Dual-granularity Contrastive Learning (DCL) for mapping coarse-grained and fine-grained visual features into the text semantic space, thereby reducing semantic inconsistency caused by modality differences at dual granularity levels, and facilitating cross-modal alignment. To address the issue of excessive filtering or excessive retention in visual information filtering, we design a Region Counterfactual Reasoning Filter (RCF) that employs Counterfactual Reasoning to determine the validity of image regions and generate category labels. These labels are then used to train Image Region Selector to select regions beneficial for summarization. Extensive experiments on the representative MMSS and MSMO dataset show that CFCR outperforms multiple strong baselines, particularly in terms of selecting and focusing on critical details, demonstrating its effectiveness in MS.
扫码关注我们
求助内容:
应助结果提醒方式:
