Pub Date : 2026-01-01Epub Date: 2025-11-25DOI: 10.1016/j.cviu.2025.104563
Zhi Chen , Zhen Yu
Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.
{"title":"Transformer tracking with high-low frequency attention","authors":"Zhi Chen , Zhen Yu","doi":"10.1016/j.cviu.2025.104563","DOIUrl":"10.1016/j.cviu.2025.104563","url":null,"abstract":"<div><div>Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104563"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-03DOI: 10.1016/j.cviu.2025.104587
Jiafeng Li, Jiajun Sun, Ziqing Li, Jing Zhang, Li Zhuo
The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.
{"title":"Multimodal driver behavior recognition based on frame-adaptive convolution and feature fusion","authors":"Jiafeng Li, Jiajun Sun, Ziqing Li, Jing Zhang, Li Zhuo","doi":"10.1016/j.cviu.2025.104587","DOIUrl":"10.1016/j.cviu.2025.104587","url":null,"abstract":"<div><div>The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104587"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-12DOI: 10.1016/j.cviu.2025.104601
Ping Li , Tao Wang , Zeyu Pan
Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (TPVA) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at https://github.com/mlvccn/TPVA_VidCap_ZeroShot.
视频字幕生成视频的描述性句子。现有的方法依赖于大量带注释的说明文字来训练模型,但是收集这么多说明文字通常是非常昂贵的。这就提出了一个挑战,即如何用不配对的视频和句子生成视频字幕,即零镜头视频字幕。虽然利用大语言模型(Large Language Model, LLM)在零镜头图像字幕中取得了一些进展,但它仍然没有考虑视频域的时间关系。如果将基于llm的图像方法直接应用到视频中,很容易导致句子中的动词和名词出现错误。为了解决这个问题,我们提出了零镜头视频字幕的时间提示引导视觉文本对象对齐(TPVA)方法。它由时间提示引导模块和可视-文本-对象对齐模块组成。前者采用预先训练好的动作识别模型,生成动作类作为时态提示的关键词,引导LLM生成包含识别动作动词的文本短语。后者分别通过计算它们的相似度得分来实现视觉文本对齐和文本对象对齐,这使得模型能够生成更好地揭示视频语义的单词。几个基准的实验结果证明了该方法在零镜头视频字幕中的优越性。代码可从https://github.com/mlvccn/TPVA_VidCap_ZeroShot获得。
{"title":"Temporal prompt guided visual–text–object alignment for zero-shot video captioning","authors":"Ping Li , Tao Wang , Zeyu Pan","doi":"10.1016/j.cviu.2025.104601","DOIUrl":"10.1016/j.cviu.2025.104601","url":null,"abstract":"<div><div>Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (<strong>TPVA</strong>) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at <span><span>https://github.com/mlvccn/TPVA_VidCap_ZeroShot</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104601"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-25DOI: 10.1016/j.cviu.2025.104575
Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård
3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.
{"title":"Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective","authors":"Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård","doi":"10.1016/j.cviu.2025.104575","DOIUrl":"10.1016/j.cviu.2025.104575","url":null,"abstract":"<div><div>3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104575"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-20DOI: 10.1016/j.cviu.2025.104574
Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin
Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.
{"title":"CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention","authors":"Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin","doi":"10.1016/j.cviu.2025.104574","DOIUrl":"10.1016/j.cviu.2025.104574","url":null,"abstract":"<div><div>Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104574"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-18DOI: 10.1016/j.cviu.2025.104611
Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li
Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.
{"title":"SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention","authors":"Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li","doi":"10.1016/j.cviu.2025.104611","DOIUrl":"10.1016/j.cviu.2025.104611","url":null,"abstract":"<div><div>Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104611"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-27DOI: 10.1016/j.cviu.2025.104586
Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo
Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module ( Module) and the Laplacian Frequency Attention Module (LFA Module). The Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that -Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at https://github.com/puzzlesong8277/GL2T-Diff.
{"title":"GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models","authors":"Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo","doi":"10.1016/j.cviu.2025.104586","DOIUrl":"10.1016/j.cviu.2025.104586","url":null,"abstract":"<div><div>Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (<span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (<span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module) and the Laplacian Frequency Attention Module (LFA Module). The <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that <span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at <span><span>https://github.com/puzzlesong8277/GL2T-Diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104586"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-26DOI: 10.1016/j.cviu.2025.104578
Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo
Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.
{"title":"Spatio-temporal transformers for action unit classification with event cameras","authors":"Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo","doi":"10.1016/j.cviu.2025.104578","DOIUrl":"10.1016/j.cviu.2025.104578","url":null,"abstract":"<div><div>Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104578"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-26DOI: 10.1016/j.cviu.2025.104572
Hongkun Zhang, Yan Wu, Zhengbin Zhang
Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at https://github.com/CHAMELENON/What2Keep.
{"title":"What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information","authors":"Hongkun Zhang, Yan Wu, Zhengbin Zhang","doi":"10.1016/j.cviu.2025.104572","DOIUrl":"10.1016/j.cviu.2025.104572","url":null,"abstract":"<div><div>Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (<span><math><mrow><msup><mrow><mn>2</mn></mrow><mrow><mn>14</mn></mrow></msup><mtext>B</mtext></mrow></math></span>). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at <span><span>https://github.com/CHAMELENON/What2Keep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104572"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via https://github.com/simple-boy/DH-Net.
{"title":"A dynamic hybrid network with attention and mamba for image captioning","authors":"Lulu Wang, Ruiji Xue, Zhengtao Yu, Ruoyu Zhang, Tongling Pan, Yingna Li","doi":"10.1016/j.cviu.2025.104617","DOIUrl":"10.1016/j.cviu.2025.104617","url":null,"abstract":"<div><div>Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via <span><span>https://github.com/simple-boy/DH-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104617"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}