Pub Date : 2025-12-03DOI: 10.1016/j.cviu.2025.104587
Jiafeng Li, Jiajun Sun, Ziqing Li, Jing Zhang, Li Zhuo
The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.
{"title":"Multimodal driver behavior recognition based on frame-adaptive convolution and feature fusion","authors":"Jiafeng Li, Jiajun Sun, Ziqing Li, Jing Zhang, Li Zhuo","doi":"10.1016/j.cviu.2025.104587","DOIUrl":"10.1016/j.cviu.2025.104587","url":null,"abstract":"<div><div>The identification of driver behavior plays a vital role in the autonomous driving systems of intelligent vehicles. However, the complexity of real-world driving scenarios presents significant challenges. Several existing approaches struggle to effectively exploit multimodal feature-level fusion and suffer from suboptimal temporal modeling, resulting in unsatisfactory performance. We introduce a new multimodal framework that combines RGB frames with skeletal data at the feature level, incorporating a frame-adaptive convolution mechanism to improve temporal modeling. Specifically, we first propose the local spatial attention enhancement module (LSAEM). This module refines RGB features using local spatial attention from skeletal features, prioritizing critical local regions and mitigating the negative effects of complex backgrounds in the RGB modality. Next, we introduce the heatmap enhancement module (HEM), which enriches skeletal features with contextual scene information from RGB heatmaps, thus addressing the lack of local scene context in skeletal data. Finally, we propose a frame-adaptive convolution mechanism that dynamically adjusts convolutional weights per frame, emphasizing key temporal frames and further strengthening the model’s temporal modeling capabilities. Extensive experiments on the Drive&Act dataset validate the efficacy of the presented approach, showing remarkable enhancements in recognition accuracy as compared to existing SOTA methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104587"},"PeriodicalIF":3.5,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.cviu.2025.104573
Zeyang Chen , Chunyu Lin , Yao Zhao , Tammam Tillo
This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.
{"title":"Unsupervised multi-modal domain adaptation for RGB-T Semantic Segmentation","authors":"Zeyang Chen , Chunyu Lin , Yao Zhao , Tammam Tillo","doi":"10.1016/j.cviu.2025.104573","DOIUrl":"10.1016/j.cviu.2025.104573","url":null,"abstract":"<div><div>This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104573"},"PeriodicalIF":3.5,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a modular augmented reality (AR) framework designed to support healthcare professionals in the real-time visualization and interaction with clinical data. The system integrates biometric patient identification, large language models (LLMs) for multimodal clinical data structuring, and ontology-driven AR overlays for anatomy-aware spatial projection. Unlike conventional systems, the framework enables immersive, context-aware visualization that improves both the accessibility and interpretability of medical information. The architecture is fully modular and mobile-compatible, allowing independent refinement of its core components. Patient identification is performed through facial recognition, while clinical documents are processed by a vision-language pipeline that standardizes heterogeneous records into structured data. Body-tracking technology anchors these parameters to the corresponding anatomical regions, supporting intuitive and dynamic interaction during consultations. The framework has been validated through a diabetology case study and a usability assessment with five clinicians, achieving a System Usability Scale (SUS) score of 73.0, which indicates good usability. Experimental results confirm the accuracy of biometric identification (97.1%). The LLM-based pipeline achieved an exact match accuracy of 98.0% for diagnosis extraction and 86.0% for treatment extraction from unstructured clinical images, confirming its reliability in structuring heterogeneous medical content. The system is released as open source to encourage reproducibility and collaborative development. Overall, this work contributes a flexible, clinician-oriented AR platform that combines biometric recognition, multimodal data processing, and interactive visualization to advance next-generation digital healthcare applications.
{"title":"A modular augmented reality framework for real-time clinical data visualization and interaction","authors":"Lucia Cascone , Lucia Cimmino , Michele Nappi , Chiara Pero","doi":"10.1016/j.cviu.2025.104594","DOIUrl":"10.1016/j.cviu.2025.104594","url":null,"abstract":"<div><div>This paper presents a modular augmented reality (AR) framework designed to support healthcare professionals in the real-time visualization and interaction with clinical data. The system integrates biometric patient identification, large language models (LLMs) for multimodal clinical data structuring, and ontology-driven AR overlays for anatomy-aware spatial projection. Unlike conventional systems, the framework enables immersive, context-aware visualization that improves both the accessibility and interpretability of medical information. The architecture is fully modular and mobile-compatible, allowing independent refinement of its core components. Patient identification is performed through facial recognition, while clinical documents are processed by a vision-language pipeline that standardizes heterogeneous records into structured data. Body-tracking technology anchors these parameters to the corresponding anatomical regions, supporting intuitive and dynamic interaction during consultations. The framework has been validated through a diabetology case study and a usability assessment with five clinicians, achieving a System Usability Scale (SUS) score of 73.0, which indicates good usability. Experimental results confirm the accuracy of biometric identification (97.1%). The LLM-based pipeline achieved an exact match accuracy of 98.0% for diagnosis extraction and 86.0% for treatment extraction from unstructured clinical images, confirming its reliability in structuring heterogeneous medical content. The system is released as open source to encourage reproducibility and collaborative development. Overall, this work contributes a flexible, clinician-oriented AR platform that combines biometric recognition, multimodal data processing, and interactive visualization to advance next-generation digital healthcare applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104594"},"PeriodicalIF":3.5,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.cviu.2025.104566
HuaDong Li
In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.
{"title":"Open-vocabulary object detection for high-resolution remote sensing images","authors":"HuaDong Li","doi":"10.1016/j.cviu.2025.104566","DOIUrl":"10.1016/j.cviu.2025.104566","url":null,"abstract":"<div><div>In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104566"},"PeriodicalIF":3.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.
{"title":"Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval","authors":"Chihiro Nakatani , Hiroaki Kawashima , Norimichi Ukita","doi":"10.1016/j.cviu.2025.104577","DOIUrl":"10.1016/j.cviu.2025.104577","url":null,"abstract":"<div><div>This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: <span><span>https://github.com/chihina/GAFL-FINE-CVIU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104577"},"PeriodicalIF":3.5,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.cviu.2025.104589
Xian-Tao Wu , Xiao-Diao Chen , Hongyu Chen , Wen Wu , Weiyin Ma , Haichuan Song
Deep learning-based shadow detection methods primarily focus on achieving higher accuracy, while often overlooking the importance of inference efficiency for downstream applications. This work attempts to reduce the number of processed patches during the feed-forward process and proposes a faster framework for shadow detection (namely FasterSD) based on vision transformer. We found that most of bright regions can converge to a stable status even at early stages of the feed-forward process, revealing massive computational redundancy. From this observation, we introduce a token pausing strategy to locate these simple patches and pause to refine their feature representations (i.e., tokens), enabling us to use most of computational resources to the remaining challenging patches. Specifically, we propose to use predicted posterior entropy as a proxy for prediction correctness, and design a random pausing scheme to ensure that the model meets flexible runtime requirements by directly adjusting the pausing configuration without repeated training. Extensive experiments on three shadow detection benchmarks (i.e., SBU, ISTD, and UCF) demonstrate that our FasterSD can run 12 faster than the state-of-the-art shadow detector with a comparable performance. The code will be available at https://github.com/wuwen1994/FasterSD.
{"title":"Pay more attention to dark regions for faster shadow detection","authors":"Xian-Tao Wu , Xiao-Diao Chen , Hongyu Chen , Wen Wu , Weiyin Ma , Haichuan Song","doi":"10.1016/j.cviu.2025.104589","DOIUrl":"10.1016/j.cviu.2025.104589","url":null,"abstract":"<div><div>Deep learning-based shadow detection methods primarily focus on achieving higher accuracy, while often overlooking the importance of inference efficiency for downstream applications. This work attempts to reduce the number of processed patches during the feed-forward process and proposes a faster framework for shadow detection (namely FasterSD) based on vision transformer. We found that most of bright regions can converge to a stable status even at early stages of the feed-forward process, revealing massive computational redundancy. From this observation, we introduce a token pausing strategy to locate these simple patches and pause to refine their feature representations (<em>i.e.</em>, tokens), enabling us to use most of computational resources to the remaining challenging patches. Specifically, we propose to use predicted posterior entropy as a proxy for prediction correctness, and design a random pausing scheme to ensure that the model meets flexible runtime requirements by directly adjusting the pausing configuration without repeated training. Extensive experiments on three shadow detection benchmarks (<em>i.e.</em>, SBU, ISTD, and UCF) demonstrate that our FasterSD can run 12<span><math><mo>×</mo></math></span> faster than the state-of-the-art shadow detector with a comparable performance. The code will be available at <span><span>https://github.com/wuwen1994/FasterSD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104589"},"PeriodicalIF":3.5,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.cviu.2025.104586
Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo
Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module ( Module) and the Laplacian Frequency Attention Module (LFA Module). The Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that -Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at https://github.com/puzzlesong8277/GL2T-Diff.
{"title":"GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models","authors":"Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo","doi":"10.1016/j.cviu.2025.104586","DOIUrl":"10.1016/j.cviu.2025.104586","url":null,"abstract":"<div><div>Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (<span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (<span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module) and the Laplacian Frequency Attention Module (LFA Module). The <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that <span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at <span><span>https://github.com/puzzlesong8277/GL2T-Diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104586"},"PeriodicalIF":3.5,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.cviu.2025.104578
Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo
Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.
{"title":"Spatio-temporal transformers for action unit classification with event cameras","authors":"Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo","doi":"10.1016/j.cviu.2025.104578","DOIUrl":"10.1016/j.cviu.2025.104578","url":null,"abstract":"<div><div>Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104578"},"PeriodicalIF":3.5,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.cviu.2025.104572
Hongkun Zhang, Yan Wu, Zhengbin Zhang
Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at https://github.com/CHAMELENON/What2Keep.
{"title":"What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information","authors":"Hongkun Zhang, Yan Wu, Zhengbin Zhang","doi":"10.1016/j.cviu.2025.104572","DOIUrl":"10.1016/j.cviu.2025.104572","url":null,"abstract":"<div><div>Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (<span><math><mrow><msup><mrow><mn>2</mn></mrow><mrow><mn>14</mn></mrow></msup><mtext>B</mtext></mrow></math></span>). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at <span><span>https://github.com/CHAMELENON/What2Keep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104572"},"PeriodicalIF":3.5,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.cviu.2025.104563
Zhi Chen , Zhen Yu
Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.
{"title":"Transformer tracking with high-low frequency attention","authors":"Zhi Chen , Zhen Yu","doi":"10.1016/j.cviu.2025.104563","DOIUrl":"10.1016/j.cviu.2025.104563","url":null,"abstract":"<div><div>Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104563"},"PeriodicalIF":3.5,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}