首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Transformer tracking with high-low frequency attention 变压器跟踪要注意高低频
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-25 DOI: 10.1016/j.cviu.2025.104563
Zhi Chen , Zhen Yu
Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.
基于变压器的跟踪器由于其强大的全局建模能力而取得了令人印象深刻的性能。然而,大多数现有的方法都采用了普通的注意力模块,它们对模板和搜索区域进行同质处理,忽略了不同频率特征的鲜明特征——高频成分捕获了对目标识别至关重要的局部细节,而低频成分提供了全局结构背景。为了弥补这一差距,我们提出了一种新颖的变压器架构,具有高-低(Hi-Lo)频率关注视觉目标跟踪。具体来说,在模板区域应用高频注意模块来保留细粒度的目标细节。相反,低频注意力模块处理搜索区域,有效捕获全局依赖关系,减少计算成本。此外,我们引入了全局-局部双交互(GLDI)模块,在模板和搜索特征映射之间建立互惠的特征增强,有效地集成了多频信息。在六个具有挑战性的基准测试(LaSOT, GOT-10k, TrackingNet, UAV123, OTB100和NFS)上进行的广泛实验表明,我们的方法HiLoTT在保持45帧/秒的实时速度的同时实现了最先进的性能。
{"title":"Transformer tracking with high-low frequency attention","authors":"Zhi Chen ,&nbsp;Zhen Yu","doi":"10.1016/j.cviu.2025.104563","DOIUrl":"10.1016/j.cviu.2025.104563","url":null,"abstract":"<div><div>Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104563"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal driver behavior recognition based on frame-adaptive convolution and feature fusion 基于帧自适应卷积和特征融合的多模式驾驶员行为识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-03 DOI: 10.1016/j.cviu.2025.104587
Jiafeng Li, Jiajun Sun, Ziqing Li, Jing Zhang, Li Zhuo
The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.
驾驶员行为识别在智能汽车自动驾驶系统中起着至关重要的作用。然而,现实驾驶场景的复杂性带来了重大挑战。现有的几种方法难以有效地利用多模态特征级融合,并且存在时间建模不理想的问题,导致性能不理想。我们引入了一个新的多模态框架,该框架将RGB帧与骨骼数据结合在特征级别,并结合帧自适应卷积机制来改进时间建模。具体而言,我们首先提出了局部空间注意增强模块(LSAEM)。该模块使用来自骨骼特征的局部空间注意力来细化RGB特征,优先考虑关键的局部区域,并减轻RGB模态中复杂背景的负面影响。接下来,我们引入热图增强模块(HEM),该模块通过RGB热图中的上下文场景信息丰富骨骼特征,从而解决骨骼数据中缺乏局部场景上下文的问题。最后,我们提出了一种帧自适应卷积机制,该机制可以动态调整每帧的卷积权重,强调关键的时间帧,进一步增强模型的时间建模能力。在Drive&;Act数据集上进行的大量实验验证了所提出方法的有效性,与现有的SOTA方法相比,识别精度显着提高。
{"title":"Multimodal driver behavior recognition based on frame-adaptive convolution and feature fusion","authors":"Jiafeng Li,&nbsp;Jiajun Sun,&nbsp;Ziqing Li,&nbsp;Jing Zhang,&nbsp;Li Zhuo","doi":"10.1016/j.cviu.2025.104587","DOIUrl":"10.1016/j.cviu.2025.104587","url":null,"abstract":"<div><div>The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&amp;Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104587"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal prompt guided visual–text–object alignment for zero-shot video captioning 零镜头视频字幕的时间提示引导视觉-文本-对象对齐
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-12 DOI: 10.1016/j.cviu.2025.104601
Ping Li , Tao Wang , Zeyu Pan
Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (TPVA) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at https://github.com/mlvccn/TPVA_VidCap_ZeroShot.
视频字幕生成视频的描述性句子。现有的方法依赖于大量带注释的说明文字来训练模型,但是收集这么多说明文字通常是非常昂贵的。这就提出了一个挑战,即如何用不配对的视频和句子生成视频字幕,即零镜头视频字幕。虽然利用大语言模型(Large Language Model, LLM)在零镜头图像字幕中取得了一些进展,但它仍然没有考虑视频域的时间关系。如果将基于llm的图像方法直接应用到视频中,很容易导致句子中的动词和名词出现错误。为了解决这个问题,我们提出了零镜头视频字幕的时间提示引导视觉文本对象对齐(TPVA)方法。它由时间提示引导模块和可视-文本-对象对齐模块组成。前者采用预先训练好的动作识别模型,生成动作类作为时态提示的关键词,引导LLM生成包含识别动作动词的文本短语。后者分别通过计算它们的相似度得分来实现视觉文本对齐和文本对象对齐,这使得模型能够生成更好地揭示视频语义的单词。几个基准的实验结果证明了该方法在零镜头视频字幕中的优越性。代码可从https://github.com/mlvccn/TPVA_VidCap_ZeroShot获得。
{"title":"Temporal prompt guided visual–text–object alignment for zero-shot video captioning","authors":"Ping Li ,&nbsp;Tao Wang ,&nbsp;Zeyu Pan","doi":"10.1016/j.cviu.2025.104601","DOIUrl":"10.1016/j.cviu.2025.104601","url":null,"abstract":"<div><div>Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (<strong>TPVA</strong>) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at <span><span>https://github.com/mlvccn/TPVA_VidCap_ZeroShot</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104601"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective 评价图像量对高斯溅射的影响:一个统计的角度
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-25 DOI: 10.1016/j.cviu.2025.104575
Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård
3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.
3D重建现在是计算机视觉的一项关键能力。随着nerf和高斯飞溅技术的进步,人们越来越需要正确捕获数据来为这些算法提供数据,并在现实世界场景中使用它们。大多数可用于高斯飞溅的公开数据集都不适合对减少相机数量或均匀放置的相机与随机放置的相机的影响进行适当的统计分析。场景中摄像机的数量会显著影响最终3D重建的精度和分辨率。因此,设计一个合适的具有一定数量摄像机的数据采集系统对于三维重建至关重要。本文引入了UnrealGaussianStat数据集,并对高斯溅射的视点递减进行了统计分析。我们发现,当摄像机数量增加到100之后,训练和测试指标趋于饱和,对重建质量没有显著影响。
{"title":"Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective","authors":"Anurag Dalal,&nbsp;Daniel Hagen,&nbsp;Kjell Gunnar Robbersmyr,&nbsp;Kristian Muri Knausgård","doi":"10.1016/j.cviu.2025.104575","DOIUrl":"10.1016/j.cviu.2025.104575","url":null,"abstract":"<div><div>3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104575"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention 融合:基于对比学习和协同注意的红外和可见光图像融合
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-20 DOI: 10.1016/j.cviu.2025.104574
Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin
Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.
由于成像原理和拍摄位置的差异,实现不同传感器图像之间严格的空间对齐是一项挑战。现有的融合方法在源图像之间存在微小位移或变形时,往往会在融合结果中引入伪影。虽然配准与融合联合训练方案通过融合对配准的反馈改善了融合效果,但仍然面临着配准精度不稳定和局部非刚性畸变产生伪影的挑战。为此,我们提出了一种新的红外与可见光失对准图像融合方法——CLAFusion。引入基于对比学习的多尺度特征提取模块(CLMFE),增强同一场景不同模态图像之间的相似性,增大不同场景图像之间的差异性,提高配准精度的稳定性。同时,设计了协同关注融合模块(CAFM),将窗口关注、梯度通道关注和融合配准反馈相结合,实现特征的精确对准和对不对准冗余特征的抑制,减轻融合结果中的伪像。大量的实验表明,该方法在不对齐图像融合和语义分割方面优于现有的方法。
{"title":"CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention","authors":"Linli Ma,&nbsp;Suzhen Lin,&nbsp;Jianchao Zeng,&nbsp;Yanbo Wang,&nbsp;Zanxia Jin","doi":"10.1016/j.cviu.2025.104574","DOIUrl":"10.1016/j.cviu.2025.104574","url":null,"abstract":"<div><div>Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104574"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention SCAFNet:基于自注意和交叉注意的多模态脑卒中医学图像合成与融合网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-18 DOI: 10.1016/j.cviu.2025.104611
Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li
Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.
早期诊断和干预对于有效降低急性缺血性脑卒中的发病率和死亡率至关重要。医学图像合成从单模态输入生成多模态图像,而图像融合则集成了跨模态的互补信息。然而,目前的方法通常分别处理这些任务,忽视了它们内在的协同作用和更丰富、更全面的诊断图像的潜力。为了克服这一问题,我们提出了一种两阶段深度学习(DL)框架来改进缺血性卒中的病变分析,该框架将医学图像合成和融合相结合,以提高诊断的信息量。在第一阶段,基于生成对抗网络(GAN)的pix2pixHD方法,从单峰输入有效地合成高保真的多模态医学图像,从而丰富可用的诊断数据供后续处理。第二阶段引入多模态医学图像融合网络SCAFNet,利用自注意和交叉注意机制。SCAFNet通过自注意捕获模态内的特征关系,强调各模态内的关键信息;通过交叉注意构建模态间的特征交互,充分利用其互补性。此外,引入信息辅助模块(IAM),方便提取更有意义的信息,提高融合图像的视觉质量。实验结果表明,所提出的框架在生成和融合图像质量方面都明显优于现有方法,突出了其在医学图像分析中的临床应用潜力。
{"title":"SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention","authors":"Yu Zhu ,&nbsp;Liqiang Song ,&nbsp;Junli Zhao ,&nbsp;Guodong Wang ,&nbsp;Hui Li ,&nbsp;Yi Li","doi":"10.1016/j.cviu.2025.104611","DOIUrl":"10.1016/j.cviu.2025.104611","url":null,"abstract":"<div><div>Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104611"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models GL2T-Diff:基于空频融合扩散模型的医学图像平移
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-27 DOI: 10.1016/j.cviu.2025.104586
Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo
Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (GL2T-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (GC2A Module) and the Laplacian Frequency Attention Module (LFA Module). The GC2A Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both GC2A Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that GL2T-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at https://github.com/puzzlesong8277/GL2T-Diff.
扩散概率模型(Diffusion Probabilistic Models, dpm)在医学图像翻译(MIT)中是有效的,但在加噪过程中容易丢失高频细节,使得在去噪过程中难以恢复这些细节。这阻碍了模型在MIT任务中准确保存解剖细节的能力,这可能最终影响诊断结果的准确性。为了解决这个问题,我们提出了一个基于卷积通道和拉普拉斯频率注意机制的扩散模型(GL2T-Diff),该模型旨在通过有效地保留关键图像特征来增强MIT任务。我们介绍了两个新颖的模块:全局信道相关注意模块(GC2A模块)和拉普拉斯频率注意模块(LFA模块)。GC2A模块增强了模型捕获通道之间全局依赖关系的能力,而LFA模块有效地保留了高频成分,这对于保存解剖结构至关重要。为了利用GC2A模块和LFA模块的互补优势,我们提出了拉普拉斯卷积注意与相位振幅融合(FusLCA),该方法可以有效地整合空间和频域特征。实验结果表明,在BraTS-2021/2024、IXI和骨盆数据集上,GL2T-Diff优于最先进的(SOTA)方法,包括基于生成对抗网络(gan)、变分自动编码器(VAEs)和其他dms的方法。代码可在https://github.com/puzzlesong8277/GL2T-Diff上获得。
{"title":"GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models","authors":"Dong Sui ,&nbsp;Nanting Song ,&nbsp;Xiao Tian ,&nbsp;Han Zhou ,&nbsp;Yacong Li ,&nbsp;Maozu Guo ,&nbsp;Kuanquan Wang ,&nbsp;Gongning Luo","doi":"10.1016/j.cviu.2025.104586","DOIUrl":"10.1016/j.cviu.2025.104586","url":null,"abstract":"<div><div>Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (<span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (<span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module) and the Laplacian Frequency Attention Module (LFA Module). The <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that <span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at <span><span>https://github.com/puzzlesong8277/GL2T-Diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104586"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-temporal transformers for action unit classification with event cameras 用事件摄像机进行动作单元分类的时空变换
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-26 DOI: 10.1016/j.cviu.2025.104578
Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo
Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.
面部分析在旨在改善人机交互、情绪健康和非语言交流监测的辅助技术中起着至关重要的作用。然而,对于更细粒度的任务,标准传感器可能无法胜任这项任务,因为它们的延迟,使得不可能记录和检测携带高信息量信号的微运动,这对于推断受试者的真实情绪是必要的。事件相机作为一种可能的解决方案已经越来越引起人们的兴趣,因为它可以解决类似的高帧率任务。本文提出了一种新的时空视觉转换模型,该模型利用移位Patch标记化(SPT)和局部自注意(LSA)来提高从事件流中分类动作单元的准确性。我们还解决了文献中缺乏标记事件数据的问题,这可能被认为是RGB和神经形态视觉模型成熟度之间存在差距的主要原因。事实上,在事件域中收集数据更加困难,因为它不能从web中抓取,标记框架应该考虑到事件聚合率和静态部分在某些框架中可能不可见的事实。为此,我们提出了FACEMORPHIC,这是一个由RGB视频和事件流组成的临时同步多模态人脸数据集。该数据集在视频级别上使用面部动作单元进行注释,并且还包含从3D形状估计到唇读等各种可能应用程序收集的流。然后,我们展示了时间同步如何允许有效的神经形态面部分析,而无需手动注释视频:我们通过在3D空间中表示面部形状来利用跨模态监督弥合域差距。这使得我们的模型适用于现实世界的辅助场景,包括保护隐私的可穿戴系统和响应式社交互动监控。我们提出的模型通过捕获空间和时间信息来优于基线方法,这对于识别细微的面部微表情至关重要。
{"title":"Spatio-temporal transformers for action unit classification with event cameras","authors":"Luca Cultrera ,&nbsp;Federico Becattini ,&nbsp;Lorenzo Berlincioni ,&nbsp;Claudio Ferrari ,&nbsp;Alberto Del Bimbo","doi":"10.1016/j.cviu.2025.104578","DOIUrl":"10.1016/j.cviu.2025.104578","url":null,"abstract":"<div><div>Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104578"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information What2Keep:通过保存有价值的信息来进行3D检测的高效通信协作感知框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-26 DOI: 10.1016/j.cviu.2025.104572
Hongkun Zhang, Yan Wu, Zhengbin Zhang
Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (214B). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at https://github.com/CHAMELENON/What2Keep.
协同感知在自动驾驶领域引起了广泛关注,因为互联自动驾驶汽车(cav)之间共享信息的能力大大提高了感知性能。然而,协作感知面临着严峻的挑战,其中有限的通信带宽仍然是一个基本瓶颈,由于现有通信技术的固有约束。带宽限制会严重降低传输的信息,导致感知性能急剧下降。为了解决这个问题,我们提出了保持什么(What2Keep),这是一个动态适应通信带宽波动的协作感知框架。我们的方法旨在在车辆之间建立共识,优先传输对自我车辆最关键的中间特征。该框架具有两个主要优点:(1)基于共识的特征选择机制有效地将不同的协作模式作为先验知识,帮助车辆保留最有价值的特征,提高通信效率,增强模型对通信退化的鲁棒性;(2) What2Keep采用跨车辆融合策略,有效聚合合作感知信息,同时对不同通信量表现出鲁棒性。大量的实验证明了我们的方法在OPV2V和V2XSet基准测试中的优越性能,分别达到了最先进的[email protected]分数83.57%和77.78%,同时在严格的带宽限制下保持了大约20%的相对改进(214B)。我们的定性实验成功地解释了What2Keep的工作机制。代码将在https://github.com/CHAMELENON/What2Keep上提供。
{"title":"What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information","authors":"Hongkun Zhang,&nbsp;Yan Wu,&nbsp;Zhengbin Zhang","doi":"10.1016/j.cviu.2025.104572","DOIUrl":"10.1016/j.cviu.2025.104572","url":null,"abstract":"<div><div>Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (<span><math><mrow><msup><mrow><mn>2</mn></mrow><mrow><mn>14</mn></mrow></msup><mtext>B</mtext></mrow></math></span>). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at <span><span>https://github.com/CHAMELENON/What2Keep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104572"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dynamic hybrid network with attention and mamba for image captioning 一个带有注意力和曼巴的动态混合网络,用于图像字幕
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-18 DOI: 10.1016/j.cviu.2025.104617
Lulu Wang, Ruiji Xue, Zhengtao Yu, Ruoyu Zhang, Tongling Pan, Yingna Li
Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via https://github.com/simple-boy/DH-Net.
图像字幕(IC)是一项关键的跨模态任务,它为视觉输入生成连贯的文本描述,架起视觉和语言领域的桥梁。基于注意力的方法极大地推动了图像字幕领域的发展。然而,经验观察表明,注意机制通常将注意力均匀地分配到特征序列的全谱上,这无意中减少了对远程依赖关系的强调。然而,这些遥远的元素在产生高质量的字幕中起着关键作用。因此,我们寻求将综合特征表示与关键信号的目标优先级协调起来的策略,最终提出了动态混合网络(Dynamic Hybrid Network, DH-Net)来提高字幕质量。具体来说,在编码器-解码器架构的基础上,我们提出了一个混合编码器(HE),将注意力机制与曼巴块集成在一起。通过利用曼巴优越的长序列建模能力,进一步补充了注意力,并实现了局部特征提取和全局上下文建模的协同结合。此外,我们在解码器中引入了特征聚合模块(FAM),该模块可以根据不断变化的解码上下文动态地适应多模态特征融合,从而确保异构特征的上下文敏感集成。对MSCOCO和Flickr30k数据集的广泛评估表明,DH-Net实现了最先进的性能,在生成准确且语义丰富的字幕方面显著优于现有方法。实现代码可通过https://github.com/simple-boy/DH-Net访问。
{"title":"A dynamic hybrid network with attention and mamba for image captioning","authors":"Lulu Wang,&nbsp;Ruiji Xue,&nbsp;Zhengtao Yu,&nbsp;Ruoyu Zhang,&nbsp;Tongling Pan,&nbsp;Yingna Li","doi":"10.1016/j.cviu.2025.104617","DOIUrl":"10.1016/j.cviu.2025.104617","url":null,"abstract":"<div><div>Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via <span><span>https://github.com/simple-boy/DH-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104617"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1