首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
A modular augmented reality framework for real-time clinical data visualization and interaction 用于实时临床数据可视化和交互的模块化增强现实框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-02 DOI: 10.1016/j.cviu.2025.104594
Lucia Cascone , Lucia Cimmino , Michele Nappi , Chiara Pero
This paper presents a modular augmented reality (AR) framework designed to support healthcare professionals in the real-time visualization and interaction with clinical data. The system integrates biometric patient identification, large language models (LLMs) for multimodal clinical data structuring, and ontology-driven AR overlays for anatomy-aware spatial projection. Unlike conventional systems, the framework enables immersive, context-aware visualization that improves both the accessibility and interpretability of medical information. The architecture is fully modular and mobile-compatible, allowing independent refinement of its core components. Patient identification is performed through facial recognition, while clinical documents are processed by a vision-language pipeline that standardizes heterogeneous records into structured data. Body-tracking technology anchors these parameters to the corresponding anatomical regions, supporting intuitive and dynamic interaction during consultations. The framework has been validated through a diabetology case study and a usability assessment with five clinicians, achieving a System Usability Scale (SUS) score of 73.0, which indicates good usability. Experimental results confirm the accuracy of biometric identification (97.1%). The LLM-based pipeline achieved an exact match accuracy of 98.0% for diagnosis extraction and 86.0% for treatment extraction from unstructured clinical images, confirming its reliability in structuring heterogeneous medical content. The system is released as open source to encourage reproducibility and collaborative development. Overall, this work contributes a flexible, clinician-oriented AR platform that combines biometric recognition, multimodal data processing, and interactive visualization to advance next-generation digital healthcare applications.
本文提出了一个模块化增强现实(AR)框架,旨在支持医疗保健专业人员实时可视化和与临床数据的交互。该系统集成了生物特征患者识别,用于多模式临床数据结构的大型语言模型(llm),以及用于解剖感知空间投影的本体驱动的AR覆盖。与传统系统不同,该框架能够实现沉浸式、上下文感知的可视化,从而提高医疗信息的可访问性和可解释性。该架构是完全模块化和移动兼容的,允许对其核心组件进行独立的改进。通过面部识别进行患者识别,而临床文件则由视觉语言管道处理,该管道将异构记录标准化为结构化数据。身体跟踪技术将这些参数锚定到相应的解剖区域,在咨询过程中支持直观和动态的交互。该框架已通过糖尿病案例研究和5名临床医生的可用性评估进行验证,系统可用性量表(SUS)得分为73.0,表明具有良好的可用性。实验结果证实了生物特征识别的准确率(97.1%)。基于llm的流水线在非结构化临床图像中诊断提取的精确匹配准确率为98.0%,治疗提取的精确匹配准确率为86.0%,证实了其在结构化异构医学内容方面的可靠性。该系统作为开放源代码发布,以鼓励可再现性和协作开发。总的来说,这项工作提供了一个灵活的、面向临床医生的AR平台,该平台结合了生物识别、多模态数据处理和交互式可视化,以推进下一代数字医疗保健应用。
{"title":"A modular augmented reality framework for real-time clinical data visualization and interaction","authors":"Lucia Cascone ,&nbsp;Lucia Cimmino ,&nbsp;Michele Nappi ,&nbsp;Chiara Pero","doi":"10.1016/j.cviu.2025.104594","DOIUrl":"10.1016/j.cviu.2025.104594","url":null,"abstract":"<div><div>This paper presents a modular augmented reality (AR) framework designed to support healthcare professionals in the real-time visualization and interaction with clinical data. The system integrates biometric patient identification, large language models (LLMs) for multimodal clinical data structuring, and ontology-driven AR overlays for anatomy-aware spatial projection. Unlike conventional systems, the framework enables immersive, context-aware visualization that improves both the accessibility and interpretability of medical information. The architecture is fully modular and mobile-compatible, allowing independent refinement of its core components. Patient identification is performed through facial recognition, while clinical documents are processed by a vision-language pipeline that standardizes heterogeneous records into structured data. Body-tracking technology anchors these parameters to the corresponding anatomical regions, supporting intuitive and dynamic interaction during consultations. The framework has been validated through a diabetology case study and a usability assessment with five clinicians, achieving a System Usability Scale (SUS) score of 73.0, which indicates good usability. Experimental results confirm the accuracy of biometric identification (97.1%). The LLM-based pipeline achieved an exact match accuracy of 98.0% for diagnosis extraction and 86.0% for treatment extraction from unstructured clinical images, confirming its reliability in structuring heterogeneous medical content. The system is released as open source to encourage reproducibility and collaborative development. Overall, this work contributes a flexible, clinician-oriented AR platform that combines biometric recognition, multimodal data processing, and interactive visualization to advance next-generation digital healthcare applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104594"},"PeriodicalIF":3.5,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-vocabulary object detection for high-resolution remote sensing images 高分辨率遥感图像的开放词汇目标检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-01 DOI: 10.1016/j.cviu.2025.104566
HuaDong Li
In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.
在高分辨率遥感解译中,目标检测正从封闭集向开放集发展,即将传统的检测模型推广到开放词汇描述的目标检测。近年来,视觉语言预训练的快速发展使得开放词汇检测(OVD)的研究成为可能,这也是人工智能从弱智能向强智能过渡的关键一步。然而,受大规模视觉语言配对数据集的限制,高分辨率遥感图像开放词汇检测的研究明显滞后于自然图像。此外,遥感目标的高尺度变异性对开放词汇目标检测提出了更大的挑战。为了解决这些挑战,我们创新地将泛化过程分解为对象级任务转换问题和语义扩展问题。在此基础上,提出了一种逐级解决问题的级联知识蒸馏模型。我们在DIOR和NWPU VHR-10数据集上评估了我们的方法。实验结果表明,该方法有效地将目标检测器推广到未知类别。
{"title":"Open-vocabulary object detection for high-resolution remote sensing images","authors":"HuaDong Li","doi":"10.1016/j.cviu.2025.104566","DOIUrl":"10.1016/j.cviu.2025.104566","url":null,"abstract":"<div><div>In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104566"},"PeriodicalIF":3.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval 团队运动视频检索中群体活动特征学习的人在环适应
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-29 DOI: 10.1016/j.cviu.2025.104577
Chihiro Nakatani , Hiroaki Kawashima , Norimichi Ukita
This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.
本文提出了一种针对群组活动特征学习(GAFL)的人在环自适应方法。在群体活动视频检索框架中引入人在环自适应,提高了检索性能。我们的方法首先以自监督的方式基于小组活动的相似性对GAF空间进行预训练,而不像之前的工作那样以监督学习的方式将视频分类为预定义的小组活动类。我们的交互式微调过程更新了GAF空间,允许用户更好地检索类似于用户提供的查询视频的视频。在这种微调中,我们提出的数据高效视频选择过程向用户提供从视频数据库中选择的几个视频,以便手动将这些视频标记为积极或消极。这些被标记的视频用于更新(即微调)GAF空间,通过对比学习,使正视频和负视频离查询视频更近或更远。我们在两个团队运动数据集上的综合实验结果验证了我们的方法显著提高了检索性能。消融研究还表明,我们的人在环适应中的几个组成部分有助于提高检索性能。代码:https://github.com/chihina/GAFL-FINE-CVIU。
{"title":"Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval","authors":"Chihiro Nakatani ,&nbsp;Hiroaki Kawashima ,&nbsp;Norimichi Ukita","doi":"10.1016/j.cviu.2025.104577","DOIUrl":"10.1016/j.cviu.2025.104577","url":null,"abstract":"<div><div>This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: <span><span>https://github.com/chihina/GAFL-FINE-CVIU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104577"},"PeriodicalIF":3.5,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pay more attention to dark regions for faster shadow detection 为了更快地检测阴影,要更加注意暗区
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.cviu.2025.104589
Xian-Tao Wu , Xiao-Diao Chen , Hongyu Chen , Wen Wu , Weiyin Ma , Haichuan Song
Deep learning-based shadow detection methods primarily focus on achieving higher accuracy, while often overlooking the importance of inference efficiency for downstream applications. This work attempts to reduce the number of processed patches during the feed-forward process and proposes a faster framework for shadow detection (namely FasterSD) based on vision transformer. We found that most of bright regions can converge to a stable status even at early stages of the feed-forward process, revealing massive computational redundancy. From this observation, we introduce a token pausing strategy to locate these simple patches and pause to refine their feature representations (i.e., tokens), enabling us to use most of computational resources to the remaining challenging patches. Specifically, we propose to use predicted posterior entropy as a proxy for prediction correctness, and design a random pausing scheme to ensure that the model meets flexible runtime requirements by directly adjusting the pausing configuration without repeated training. Extensive experiments on three shadow detection benchmarks (i.e., SBU, ISTD, and UCF) demonstrate that our FasterSD can run 12× faster than the state-of-the-art shadow detector with a comparable performance. The code will be available at https://github.com/wuwen1994/FasterSD.
基于深度学习的阴影检测方法主要侧重于实现更高的准确性,而往往忽略了下游应用的推理效率的重要性。本工作试图减少前馈过程中处理的斑块数量,并提出了一种基于视觉变换的更快的阴影检测框架(即FasterSD)。我们发现,即使在前馈过程的早期阶段,大多数亮区也可以收敛到稳定状态,从而显示出大量的计算冗余。从这个观察中,我们引入了一个令牌暂停策略来定位这些简单的补丁,并暂停以改进它们的特征表示(即令牌),使我们能够将大部分计算资源用于剩余的具有挑战性的补丁。具体而言,我们建议使用预测后验熵作为预测正确性的代理,并设计随机暂停方案,通过直接调整暂停配置而无需重复训练来确保模型满足灵活的运行时需求。在三个阴影检测基准(即SBU, ISTD和UCF)上进行的大量实验表明,我们的FasterSD在性能相当的情况下比最先进的阴影检测器快12倍。代码可在https://github.com/wuwen1994/FasterSD上获得。
{"title":"Pay more attention to dark regions for faster shadow detection","authors":"Xian-Tao Wu ,&nbsp;Xiao-Diao Chen ,&nbsp;Hongyu Chen ,&nbsp;Wen Wu ,&nbsp;Weiyin Ma ,&nbsp;Haichuan Song","doi":"10.1016/j.cviu.2025.104589","DOIUrl":"10.1016/j.cviu.2025.104589","url":null,"abstract":"<div><div>Deep learning-based shadow detection methods primarily focus on achieving higher accuracy, while often overlooking the importance of inference efficiency for downstream applications. This work attempts to reduce the number of processed patches during the feed-forward process and proposes a faster framework for shadow detection (namely FasterSD) based on vision transformer. We found that most of bright regions can converge to a stable status even at early stages of the feed-forward process, revealing massive computational redundancy. From this observation, we introduce a token pausing strategy to locate these simple patches and pause to refine their feature representations (<em>i.e.</em>, tokens), enabling us to use most of computational resources to the remaining challenging patches. Specifically, we propose to use predicted posterior entropy as a proxy for prediction correctness, and design a random pausing scheme to ensure that the model meets flexible runtime requirements by directly adjusting the pausing configuration without repeated training. Extensive experiments on three shadow detection benchmarks (<em>i.e.</em>, SBU, ISTD, and UCF) demonstrate that our FasterSD can run 12<span><math><mo>×</mo></math></span> faster than the state-of-the-art shadow detector with a comparable performance. The code will be available at <span><span>https://github.com/wuwen1994/FasterSD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104589"},"PeriodicalIF":3.5,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models GL2T-Diff:基于空频融合扩散模型的医学图像平移
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.cviu.2025.104586
Dong Sui , Nanting Song , Xiao Tian , Han Zhou , Yacong Li , Maozu Guo , Kuanquan Wang , Gongning Luo
Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (GL2T-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (GC2A Module) and the Laplacian Frequency Attention Module (LFA Module). The GC2A Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both GC2A Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that GL2T-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at https://github.com/puzzlesong8277/GL2T-Diff.
扩散概率模型(Diffusion Probabilistic Models, dpm)在医学图像翻译(MIT)中是有效的,但在加噪过程中容易丢失高频细节,使得在去噪过程中难以恢复这些细节。这阻碍了模型在MIT任务中准确保存解剖细节的能力,这可能最终影响诊断结果的准确性。为了解决这个问题,我们提出了一个基于卷积通道和拉普拉斯频率注意机制的扩散模型(GL2T-Diff),该模型旨在通过有效地保留关键图像特征来增强MIT任务。我们介绍了两个新颖的模块:全局信道相关注意模块(GC2A模块)和拉普拉斯频率注意模块(LFA模块)。GC2A模块增强了模型捕获通道之间全局依赖关系的能力,而LFA模块有效地保留了高频成分,这对于保存解剖结构至关重要。为了利用GC2A模块和LFA模块的互补优势,我们提出了拉普拉斯卷积注意与相位振幅融合(FusLCA),该方法可以有效地整合空间和频域特征。实验结果表明,在BraTS-2021/2024、IXI和骨盆数据集上,GL2T-Diff优于最先进的(SOTA)方法,包括基于生成对抗网络(gan)、变分自动编码器(VAEs)和其他dms的方法。代码可在https://github.com/puzzlesong8277/GL2T-Diff上获得。
{"title":"GL2T-Diff: Medical image translation via spatial-frequency fusion diffusion models","authors":"Dong Sui ,&nbsp;Nanting Song ,&nbsp;Xiao Tian ,&nbsp;Han Zhou ,&nbsp;Yacong Li ,&nbsp;Maozu Guo ,&nbsp;Kuanquan Wang ,&nbsp;Gongning Luo","doi":"10.1016/j.cviu.2025.104586","DOIUrl":"10.1016/j.cviu.2025.104586","url":null,"abstract":"<div><div>Diffusion Probabilistic Models (DPMs) are effective in medical image translation (MIT), but they tend to lose high-frequency details during the noise addition process, making it challenging to recover these details during the denoising process. This hinders the model’s ability to accurately preserve anatomical details during MIT tasks, which may ultimately affect the accuracy of diagnostic outcomes. To address this issue, we propose a diffusion model (<span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff) based on convolutional channel and Laplacian frequency attention mechanisms, which is designed to enhance MIT tasks by effectively preserving critical image features. We introduce two novel modules: the Global Channel Correlation Attention Module (<span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module) and the Laplacian Frequency Attention Module (LFA Module). The <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module enhances the model’s ability to capture global dependencies between channels, while the LFA Module effectively retains high-frequency components, which are crucial for preserving anatomical structures. To leverage the complementary strengths of both <span><math><mrow><msup><mrow><mi>GC</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>A</mi></mrow></math></span> Module and LFA Module, we propose the Laplacian Convolutional Attention with Phase-Amplitude Fusion (FusLCA), which facilitates effective integration of spatial and frequency domain features. Experimental results show that <span><math><mrow><msup><mrow><mi>GL</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>T</mi></mrow></math></span>-Diff outperforms state-of-the-art (SOTA) methods, including those based on Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and other DPMs, across the BraTS-2021/2024, IXI, and Pelvic datasets. The code is available at <span><span>https://github.com/puzzlesong8277/GL2T-Diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104586"},"PeriodicalIF":3.5,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-temporal transformers for action unit classification with event cameras 用事件摄像机进行动作单元分类的时空变换
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-26 DOI: 10.1016/j.cviu.2025.104578
Luca Cultrera , Federico Becattini , Lorenzo Berlincioni , Claudio Ferrari , Alberto Del Bimbo
Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.
面部分析在旨在改善人机交互、情绪健康和非语言交流监测的辅助技术中起着至关重要的作用。然而,对于更细粒度的任务,标准传感器可能无法胜任这项任务,因为它们的延迟,使得不可能记录和检测携带高信息量信号的微运动,这对于推断受试者的真实情绪是必要的。事件相机作为一种可能的解决方案已经越来越引起人们的兴趣,因为它可以解决类似的高帧率任务。本文提出了一种新的时空视觉转换模型,该模型利用移位Patch标记化(SPT)和局部自注意(LSA)来提高从事件流中分类动作单元的准确性。我们还解决了文献中缺乏标记事件数据的问题,这可能被认为是RGB和神经形态视觉模型成熟度之间存在差距的主要原因。事实上,在事件域中收集数据更加困难,因为它不能从web中抓取,标记框架应该考虑到事件聚合率和静态部分在某些框架中可能不可见的事实。为此,我们提出了FACEMORPHIC,这是一个由RGB视频和事件流组成的临时同步多模态人脸数据集。该数据集在视频级别上使用面部动作单元进行注释,并且还包含从3D形状估计到唇读等各种可能应用程序收集的流。然后,我们展示了时间同步如何允许有效的神经形态面部分析,而无需手动注释视频:我们通过在3D空间中表示面部形状来利用跨模态监督弥合域差距。这使得我们的模型适用于现实世界的辅助场景,包括保护隐私的可穿戴系统和响应式社交互动监控。我们提出的模型通过捕获空间和时间信息来优于基线方法,这对于识别细微的面部微表情至关重要。
{"title":"Spatio-temporal transformers for action unit classification with event cameras","authors":"Luca Cultrera ,&nbsp;Federico Becattini ,&nbsp;Lorenzo Berlincioni ,&nbsp;Claudio Ferrari ,&nbsp;Alberto Del Bimbo","doi":"10.1016/j.cviu.2025.104578","DOIUrl":"10.1016/j.cviu.2025.104578","url":null,"abstract":"<div><div>Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104578"},"PeriodicalIF":3.5,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information What2Keep:通过保存有价值的信息来进行3D检测的高效通信协作感知框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-26 DOI: 10.1016/j.cviu.2025.104572
Hongkun Zhang, Yan Wu, Zhengbin Zhang
Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (214B). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at https://github.com/CHAMELENON/What2Keep.
协同感知在自动驾驶领域引起了广泛关注,因为互联自动驾驶汽车(cav)之间共享信息的能力大大提高了感知性能。然而,协作感知面临着严峻的挑战,其中有限的通信带宽仍然是一个基本瓶颈,由于现有通信技术的固有约束。带宽限制会严重降低传输的信息,导致感知性能急剧下降。为了解决这个问题,我们提出了保持什么(What2Keep),这是一个动态适应通信带宽波动的协作感知框架。我们的方法旨在在车辆之间建立共识,优先传输对自我车辆最关键的中间特征。该框架具有两个主要优点:(1)基于共识的特征选择机制有效地将不同的协作模式作为先验知识,帮助车辆保留最有价值的特征,提高通信效率,增强模型对通信退化的鲁棒性;(2) What2Keep采用跨车辆融合策略,有效聚合合作感知信息,同时对不同通信量表现出鲁棒性。大量的实验证明了我们的方法在OPV2V和V2XSet基准测试中的优越性能,分别达到了最先进的[email protected]分数83.57%和77.78%,同时在严格的带宽限制下保持了大约20%的相对改进(214B)。我们的定性实验成功地解释了What2Keep的工作机制。代码将在https://github.com/CHAMELENON/What2Keep上提供。
{"title":"What2Keep: A communication-efficient collaborative perception framework for 3D detection via keeping valuable information","authors":"Hongkun Zhang,&nbsp;Yan Wu,&nbsp;Zhengbin Zhang","doi":"10.1016/j.cviu.2025.104572","DOIUrl":"10.1016/j.cviu.2025.104572","url":null,"abstract":"<div><div>Collaborative perception has aroused significant attention in autonomous driving, as the ability to share information among Connected Autonomous Vehicles (CAVs) substantially enhances perception performance. However, collaborative perception faces critical challenges, among which limited communication bandwidth remains a fundamental bottleneck due to inherent constraints in current communication technologies. Bandwidth limitations can severely degrade transmitted information, leading to a sharp decline in perception performance. To address this issue, we propose What To Keep (What2Keep), a collaborative perception framework that dynamically adapts to communication bandwidth fluctuations. Our method aims to establish a consensus between vehicles, prioritizing the transmission of intermediate features that are most critical to the ego vehicle. The proposed framework offers two key advantages: (1) the consensus-based feature selection mechanism effectively incorporates different collaborative patterns as prior knowledge to help vehicles preserves the most valuable features, improving communication efficiency and enhancing model robustness against communication degradation; and (2) What2Keep employs a cross-vehicle fusion strategy that effectively aggregates cooperative perception information while exhibiting robustness against varying communication volume. Extensive experiments have demonstrated the superior performance of our method in OPV2V and V2XSet benchmarks, achieving state-of-the-art [email protected] scores of 83.57% and 77.78% respectively while maintaining approximately 20% relative improvement under severe bandwidth constraints (<span><math><mrow><msup><mrow><mn>2</mn></mrow><mrow><mn>14</mn></mrow></msup><mtext>B</mtext></mrow></math></span>). Our qualitative experiments successfully explain the working mechanism of What2Keep. Code will be available at <span><span>https://github.com/CHAMELENON/What2Keep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104572"},"PeriodicalIF":3.5,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer tracking with high-low frequency attention 变压器跟踪要注意高低频
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-25 DOI: 10.1016/j.cviu.2025.104563
Zhi Chen , Zhen Yu
Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.
基于变压器的跟踪器由于其强大的全局建模能力而取得了令人印象深刻的性能。然而,大多数现有的方法都采用了普通的注意力模块,它们对模板和搜索区域进行同质处理,忽略了不同频率特征的鲜明特征——高频成分捕获了对目标识别至关重要的局部细节,而低频成分提供了全局结构背景。为了弥补这一差距,我们提出了一种新颖的变压器架构,具有高-低(Hi-Lo)频率关注视觉目标跟踪。具体来说,在模板区域应用高频注意模块来保留细粒度的目标细节。相反,低频注意力模块处理搜索区域,有效捕获全局依赖关系,减少计算成本。此外,我们引入了全局-局部双交互(GLDI)模块,在模板和搜索特征映射之间建立互惠的特征增强,有效地集成了多频信息。在六个具有挑战性的基准测试(LaSOT, GOT-10k, TrackingNet, UAV123, OTB100和NFS)上进行的广泛实验表明,我们的方法HiLoTT在保持45帧/秒的实时速度的同时实现了最先进的性能。
{"title":"Transformer tracking with high-low frequency attention","authors":"Zhi Chen ,&nbsp;Zhen Yu","doi":"10.1016/j.cviu.2025.104563","DOIUrl":"10.1016/j.cviu.2025.104563","url":null,"abstract":"<div><div>Transformer-based trackers have achieved impressive performance due to their powerful global modeling capability. However, most existing methods employ vanilla attention modules, which treat template and search regions homogeneously and overlook the distinct characteristics of different frequency features—high-frequency components capture local details critical for target identification, while low-frequency components provide global structural context. To bridge this gap, we propose a novel Transformer architecture with High-low (Hi–Lo) frequency attention for visual object tracking. Specifically, a high-frequency attention module is applied to the template region to preserve fine-grained target details. Conversely, a low-frequency attention module processes the search region to efficiently capture global dependencies with reduced computational cost. Furthermore, we introduce a Global–Local Dual Interaction (GLDI) module to establish reciprocal feature enhancement between the template and search feature maps, effectively integrating multi-frequency information. Extensive experiments on six challenging benchmarks (LaSOT, GOT-10k, TrackingNet, UAV123, OTB100, and NFS) demonstrate that our method, named HiLoTT, achieves state-of-the-art performance while maintaining a real-time speed of 45 frames per second.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104563"},"PeriodicalIF":3.5,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective 评价图像量对高斯溅射的影响:一个统计的角度
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-25 DOI: 10.1016/j.cviu.2025.104575
Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård
3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.
3D重建现在是计算机视觉的一项关键能力。随着nerf和高斯飞溅技术的进步,人们越来越需要正确捕获数据来为这些算法提供数据,并在现实世界场景中使用它们。大多数可用于高斯飞溅的公开数据集都不适合对减少相机数量或均匀放置的相机与随机放置的相机的影响进行适当的统计分析。场景中摄像机的数量会显著影响最终3D重建的精度和分辨率。因此,设计一个合适的具有一定数量摄像机的数据采集系统对于三维重建至关重要。本文引入了UnrealGaussianStat数据集,并对高斯溅射的视点递减进行了统计分析。我们发现,当摄像机数量增加到100之后,训练和测试指标趋于饱和,对重建质量没有显著影响。
{"title":"Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective","authors":"Anurag Dalal,&nbsp;Daniel Hagen,&nbsp;Kjell Gunnar Robbersmyr,&nbsp;Kristian Muri Knausgård","doi":"10.1016/j.cviu.2025.104575","DOIUrl":"10.1016/j.cviu.2025.104575","url":null,"abstract":"<div><div>3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104575"},"PeriodicalIF":3.5,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLIE-Face: A multi-modal dataset for low-light facial image enhancement LLIE-Face:用于弱光面部图像增强的多模态数据集
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-24 DOI: 10.1016/j.cviu.2025.104576
Haoyuan Sun , Dahua Gao , Pengfei He , Xiaoqian Li , Fuming Wang
Low-Light Image Enhancement (LLIE) plays a crucial role in the field of computer vision, particularly in tasks such as face recognition and surveillance systems, where clear visual information is significant. However, existing LLIE paired datasets are scarce, especially those focused on facial datasets, which has somewhat limited the development of robust LLIE methods. Furthermore, existing LLIE methods still have performance bottlenecks under extreme low-light conditions. To address these challenges, inspired by the ability of infrared images to provide additional details and contrast information unaffected by lighting conditions, we propose a new dataset, LLIE-Face, which contains 500 pairs of low-light, infrared, and normal-light facial images. Based on this dataset, we design a Brightness and Structure Decoupling Network (BSDNet), which uses two branches to process the brightness and structural information of the image separately. The goal is to enhance the brightness while simultaneously recovering fine details. Additionally, we introduce the Cross Attention State Space Model (CASSM) module, designed to facilitate effective interaction between brightness and structural information. Finally, we fully consider the intrinsic relationship between low-light image enhancement and image fusion, achieve effective image enhancement. Using the LLIE-Face dataset, we train and evaluate both BSDNet and SOTA models, conducting comprehensive benchmarking. Experimental results demonstrate that the proposed method significantly improves image contrast, detail clarity, and visual quality under extreme low-light conditions.
微光图像增强(LLIE)在计算机视觉领域起着至关重要的作用,特别是在人脸识别和监视系统等任务中,清晰的视觉信息非常重要。然而,现有的LLIE配对数据集很少,特别是那些专注于面部数据集的数据集,这在一定程度上限制了鲁棒LLIE方法的发展。此外,现有的LLIE方法在极弱光条件下仍然存在性能瓶颈。为了解决这些挑战,受红外图像提供不受光照条件影响的额外细节和对比度信息的能力的启发,我们提出了一个新的数据集LLIE-Face,它包含500对低光照、红外和正常光照下的面部图像。在此基础上,设计了亮度与结构解耦网络(BSDNet),该网络采用两个分支分别处理图像的亮度和结构信息。目标是在恢复细节的同时增强亮度。此外,我们引入了交叉注意状态空间模型(CASSM)模块,旨在促进亮度和结构信息之间的有效交互。最后,充分考虑弱光图像增强与图像融合的内在关系,实现有效的图像增强。使用LLIE-Face数据集,我们训练和评估了BSDNet和SOTA模型,进行了全面的基准测试。实验结果表明,在极弱光条件下,该方法显著提高了图像对比度、细节清晰度和视觉质量。
{"title":"LLIE-Face: A multi-modal dataset for low-light facial image enhancement","authors":"Haoyuan Sun ,&nbsp;Dahua Gao ,&nbsp;Pengfei He ,&nbsp;Xiaoqian Li ,&nbsp;Fuming Wang","doi":"10.1016/j.cviu.2025.104576","DOIUrl":"10.1016/j.cviu.2025.104576","url":null,"abstract":"<div><div>Low-Light Image Enhancement (LLIE) plays a crucial role in the field of computer vision, particularly in tasks such as face recognition and surveillance systems, where clear visual information is significant. However, existing LLIE paired datasets are scarce, especially those focused on facial datasets, which has somewhat limited the development of robust LLIE methods. Furthermore, existing LLIE methods still have performance bottlenecks under extreme low-light conditions. To address these challenges, inspired by the ability of infrared images to provide additional details and contrast information unaffected by lighting conditions, we propose a new dataset, LLIE-Face, which contains 500 pairs of low-light, infrared, and normal-light facial images. Based on this dataset, we design a Brightness and Structure Decoupling Network (BSDNet), which uses two branches to process the brightness and structural information of the image separately. The goal is to enhance the brightness while simultaneously recovering fine details. Additionally, we introduce the Cross Attention State Space Model (CASSM) module, designed to facilitate effective interaction between brightness and structural information. Finally, we fully consider the intrinsic relationship between low-light image enhancement and image fusion, achieve effective image enhancement. Using the LLIE-Face dataset, we train and evaluate both BSDNet and SOTA models, conducting comprehensive benchmarking. Experimental results demonstrate that the proposed method significantly improves image contrast, detail clarity, and visual quality under extreme low-light conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104576"},"PeriodicalIF":3.5,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1