首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals 智能环境中物联网和人工智能技术的先进融合:增强视障人士的环境感知和移动解决方案
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi

Objective

To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.

Methods

The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.

Results

The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.

Conclusion

The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
目的开发一种集成多种传感器模式的鲁棒模型,以增强视障人士的环境感知和行动能力,提高他们在室内和室外环境中的自主性和安全性。该系统利用先进的物联网和人工智能技术,通过递归贝叶斯滤波、基于核的融合算法和概率图模型集成来自邻近、环境光和运动传感器的数据。在不同的环境中收集了一个全面的数据集,以训练和评估模型在实时环境上下文估计和运动活动检测中的准确性。本研究采用多学科方法,将物联网(IoT)和人工智能(AI)相结合,开发了一种辅助视障人士的拟议模型。该研究在沙特阿拉伯进行了六个月(2024年4月至2024年9月),利用了Najran大学的资源。数据收集涉及在各种室内和室外环境中部署物联网设备,包括住宅区、商业空间和城市街道,以确保多样性和现实世界的适用性。该系统利用接近传感器、环境光传感器和运动探测器来收集不同照明、天气和动态条件下的数据。采用递归贝叶斯滤波、基于核的融合算法和概率图形模型来处理传感器输入,并提供实时环境上下文和运动检测。该研究使用收集的数据集进行了严格的训练和验证过程,确保了在不同场景下的可靠性和可扩展性。整个项目都坚持伦理考虑,没有与人类受试者直接互动。结果该模型预测环境背景的准确率为85%,运动检测的准确率为82%,精度和f1分数分别达到88%和85%。实时实现为环境变化和运动活动提供了可靠的动态反馈,显著增强了态势感知能力。该模型有效地结合传感器数据,为视障人士提供实时、情境感知的帮助,提高他们在复杂环境中导航的能力。该系统提供了辅助技术的重大进步,并有望进一步增强更广泛的应用。
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi ,&nbsp;Mrim M. Alnfiai ,&nbsp;Mona Mohammed Alnahari ,&nbsp;Salma Mohsen M. Alnefaie ,&nbsp;Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A human layout consistency framework for image-based virtual try-on 基于图像的虚拟试戴的人工布局一致性框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
基于图像的虚拟试戴,通常被认为是一种生成图像到图像的转换任务,由于它消除了对昂贵的3D扫描设备的需求,已经获得了重要的研究兴趣。在该领域中,图像绘制和循环一致性一直是占主导地位的框架,但它们在试戴网络之间的跨属性自适应和参数共享方面仍然面临挑战。本文提出了一个新的框架,称为人类布局一致性,基于直观的洞察力,高质量的试戴结果应该与连贯的人类布局保持一致。在该框架下,试戴网络配备了上游人工布局生成器(HLG)和下游人工布局解析器(HLP)。前者生成预期的人体布局,就好像这个人穿着所选择的目标服装一样,而后者则从试穿结果中提取解析后的实际人体布局。监控信号不依赖于真实图像对,通过评估预期布局和实际布局之间的一致性来构建。我们设计了一种双阶段训练策略,首先预热HLG和HLP,然后结合基于人类布局一致性的监控信号来训练试戴网络。在此基础上,提出的框架可以在训练过程中任意选择目标服装,从而使试穿网络具有跨属性自适应能力。此外,所提出的框架在单个试上线网络中运行,而不是两个物理上分开的网络,从而避免了参数共享问题。我们对基准的VITON数据集进行了定性和定量实验。实验结果表明,我们的提议可以产生高质量的试戴结果,比基线高出0.75%至10.58%。烧蚀和可视化结果进一步表明,该方法对跨属性翻译具有良好的适应性,显示了其实际应用潜力。
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang ,&nbsp;Zhicheng Wang ,&nbsp;Hao Liu ,&nbsp;Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection 双流RGB-D显著目标检测的多模态协同融合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105835
Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu
Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.
现有的RGB-D显著目标检测任务大多采用卷积运算来设计复杂的融合模块,实现跨模态信息融合。如何正确地将RGB和深度特征整合到多模态特征中,对于显著目标检测(SOD)具有重要意义。由于不同模态特征之间存在差异,严重阻碍了显著目标检测模型获得更好的性能。为了解决上述问题,我们设计了一个多模态协同融合网络(MCFNet)来实现RGB-D SOD。首先,提出边缘特征细化模块,去除浅层特征中的干扰信息,提高SOD边缘精度;其次,设计深度优化模块,对深度图中的错误估计进行优化,有效降低了噪声的影响,提高了模型的性能;最后,我们构建了一个递进融合模块,以分层的方式逐步融合RGB和深度特征,以实现高效的跨模态特征融合。在六个数据集上的实验结果表明,我们的MCFNet比其他最先进的方法(SOTA)性能更好,为显著目标检测任务提供了新的思路。
{"title":"Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection","authors":"Jingyu Wu ,&nbsp;Fuming Sun ,&nbsp;Haojie Li ,&nbsp;Mingyu Lu","doi":"10.1016/j.imavis.2025.105835","DOIUrl":"10.1016/j.imavis.2025.105835","url":null,"abstract":"<div><div>Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105835"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network CFE-PVTSeg:跨域频率增强金字塔视觉变压器分割网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-17 DOI: 10.1016/j.imavis.2025.105824
Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang
Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.
当前的息肉分割方法主要依赖于独立的卷积神经网络(cnn)或Transformer架构,它们在平衡全局-局部上下文关系和保留高频结构细节方面存在固有的局限性。为了解决这些挑战,本研究提出了一种跨域频率增强金字塔视觉变压器分割网络(CFE-PVTSeg)。在编码器中,网络将Transformer编码器与小波变换相结合,实现层次化特征增强:分别提取多尺度空间特征(基于Pyramid Vision Transformer)和频域特征(基于离散小波变换),通过跨域融合机制增强高频成分。同时,具有增强适应性的可变形卷积与具有稳定性的规则卷积相结合,以聚合适应息肉不规则形态变化的边界敏感特征。在解码器中,设计了一种创新的多尺度特征不确定性增强(MS-FUE)模块,该模块利用来自编码器的不确定性映射自适应加权和细化上采样特征,从而有效地抑制不确定性成分,同时增强可靠信息的传播。最后,通过多层次的融合策略,模型输出精细化的特征,将高层语义与低层空间细节深度融合。在五个公共基准数据集(Kvasir-SEG、CVC-ClinicDB、CVC-ColonDB、ETIS和CVC-300)上进行的大量实验表明,在处理规模变化和模糊边界等具有挑战性的场景时,与现有方法相比,CFE-PVTSeg具有更好的鲁棒性和分割精度。消融研究进一步验证了所提出的跨域增强编码器和不确定性驱动解码器的有效性,特别是在抑制特征噪声和提高对具有异质外观特征的息肉的形态适应性方面。
{"title":"CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network","authors":"Niu Guo,&nbsp;Yi Liu,&nbsp;Pengcheng Zhang,&nbsp;Jiaqi Kang,&nbsp;Zhiguo Gui,&nbsp;Lei Wang","doi":"10.1016/j.imavis.2025.105824","DOIUrl":"10.1016/j.imavis.2025.105824","url":null,"abstract":"<div><div>Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105824"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Better early detector for high-performance detection transformer 更好的早期检测器用于高性能变压器检测
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-17 DOI: 10.1016/j.imavis.2025.105829
Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu
Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.
变形金刚正在彻底改变人工智能的面貌,统一了自然语言处理、计算机视觉等方面的体系结构。在本文中,我们探讨了基于变压器的体系结构在目标检测方面能走多远——这是计算机视觉中的一项基本任务,适用于一系列工程应用。我们发现,引入早期检测器可以提高检测变压器的性能,使它们知道聚焦在哪里。为此,我们提出了一种新的关注图来弥补特征图的辅助损失,并提出了一种新的局部二部匹配策略来无成本地获得高性能检测变压器(BETR)的更好的早期检测器。在COCO数据集上,BETR向Swin Transformer主干添加了不超过600万个参数,在不同模型尺度的现有完全基于Transformer的检测器中实现了最高的AP和延迟。作为变压器探测器,BETR还首次展示了与以前最先进的基于cnn的GFLV2框架相当的准确性,速度和参数。
{"title":"Better early detector for high-performance detection transformer","authors":"Bin Hu ,&nbsp;Bencheng Liao ,&nbsp;Jiyang Qi ,&nbsp;Shusheng Yang ,&nbsp;Wenyu Liu","doi":"10.1016/j.imavis.2025.105829","DOIUrl":"10.1016/j.imavis.2025.105829","url":null,"abstract":"<div><div>Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105829"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification MedSetFeat++:一个用于少镜头医学图像分类的注意力富集集特征框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-15 DOI: 10.1016/j.imavis.2025.105825
Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh
Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400× magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.
少镜头学习(FSL)已成为解决医学图像分类中标注数据有限的挑战的一种有前途的解决方案。然而,传统的FSL方法往往只从一个卷积层中提取特征。这限制了它们捕捉详细的空间、语义和上下文信息的能力,而这些信息对于在复杂的医疗场景中进行准确分类非常重要。为了克服这些限制,本研究引入了MedSetFeat++,这是一种改进的集特征学习框架,具有增强的注意机制,专为少镜头医学图像分类而设计。它通过结合几个关键的创新扩展了SetFeat架构。它使用多头注意机制,并在多个尺度上对查询、键和值进行投影,从而允许在不同级别上进行更详细的功能交互。它还包括可学习的位置嵌入来保存空间信息。引入自适应头部门控方法,对注意力流进行动态控制。此外,基于卷积块注意模块(CBAM)的注意模块用于提高对数据中最相关区域的关注。为了评估MedSetFeat++的性能和通用性,我们使用了三种不同的医学成像数据集:HAM10000、400倍放大率下的BreakHis和Kvasir进行了广泛的实验。在双向10次查询设置下,模型在HAM10000上的准确率为92.17%,在BreakHis上的准确率为70.89%,在Kvasir上的准确率为73.46%。该模型在1枪、5枪和10枪设置下的多重双向分类任务中优于最先进的方法。这些结果表明,MedSetFeat++是一个强大且适应性强的框架,可用于提高少量医学图像分类的性能。
{"title":"MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification","authors":"Ankit Kumar Titoriya,&nbsp;Maheshwari Prasad Singh,&nbsp;Amit Kumar Singh","doi":"10.1016/j.imavis.2025.105825","DOIUrl":"10.1016/j.imavis.2025.105825","url":null,"abstract":"<div><div>Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400<span><math><mo>×</mo></math></span> magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105825"},"PeriodicalIF":4.2,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach 海事目标跟踪:一种基于时空和元数据的方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.imavis.2025.105826
Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi
Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.
In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.
We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.
Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.
在海上环境中使用无人机视频流进行目标跟踪和重新识别(Re-ID)提出了重大挑战,特别是在搜索和救援行动中。这些挑战主要来自无人机高度的小尺寸物体,无人机框架的突然运动以及物体的有限外观多样性。在这些具有挑战性的条件下,频繁的遮挡使得Re-ID难以长期跟踪。在这项工作中,我们提出了一个新的框架,即基于时空和元数据建模的海事目标跟踪(MOT-STM),旨在在具有挑战性的环境中对海事目标进行鲁棒跟踪和重新识别。该框架采用跨阶段部分与全阶段(C2FDark)主干网进行多分辨率空间特征提取,并结合视频Swin变压器(VST)进行时间建模,实现了有效的时空表征。该设计增强了探测能力,显著提高了海事领域的跟踪性能。我们还提出了一种元数据驱动的Re-ID模块,称为元数据辅助Re-ID (MARe-ID),它利用无人机的元数据,如全球定位系统(GPS)坐标、高度和相机方向来增强长期跟踪。与传统的基于外观的Re-ID不同,MARe-ID即使在被跟踪对象之间的视觉多样性有限的情况下仍然有效,并且足够通用,可以作为Re-ID模块集成到任何最先进的(SotA)多目标跟踪框架中。通过在具有挑战性的SeaDronesSee数据集上进行大量实验,我们证明了MOT-STM在海上目标跟踪方面明显优于现有方法。我们的方法达到了最先进的性能,HOTA得分为70.14%,IDF1得分为88.70%,显示了所提出的MOT-STM框架的有效性和鲁棒性。
{"title":"MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach","authors":"Vinayak S. Nageli ,&nbsp;Arshad Jamal ,&nbsp;Puneet Goyal ,&nbsp;Rama Krishna Sai S Gorthi","doi":"10.1016/j.imavis.2025.105826","DOIUrl":"10.1016/j.imavis.2025.105826","url":null,"abstract":"<div><div>Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.</div><div>In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.</div><div>We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.</div><div>Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105826"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Object Localization driven by self-supervised foundation models: A comprehensive review 基于自监督基础模型驱动的无监督对象定位:综述
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.imavis.2025.105807
Sotirios Papadopoulos , Emmanouil Patsiouras , Konstantinos Ioannidis , Stefanos Vrochidis , Ioannis Kompatsiaris , Ioannis Patras
Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.
目标定位是计算机视觉中的一项基本任务,传统上需要标记数据集才能获得准确的结果。自监督学习的最新进展使无监督对象定位成为可能,减少了对人工注释的依赖。与依赖于带注释的训练数据的监督式编码器不同,自监督式编码器直接从大量未标记的图像集合中学习语义表示。这使得它们成为无监督对象定位的自然基础,因为它们捕获了对象相关的特征,同时消除了昂贵的手动标签的需要。这些编码器产生语义上一致的补丁嵌入。对这些嵌入进行分组,就会得到一组对应于图像中物体的补丁。这些补丁集可以转换为对象掩码或边界框,从而实现单对象发现、多对象检测和实例分割等任务。通过离线掩码聚类或使用预训练的视觉语言模型,无监督定位方法可以为发现的对象分配语义标签。这将最初的类不可知对象(没有类标签的对象)转换为类感知对象(有类标签的对象),并将这些任务与受监督的对应对象对齐。本文提供了一个结构化的回顾无监督对象定位方法在类别不可知论和类别感知设置。相比之下,之前的调查只关注与类别无关的定位。我们讨论了基于自监督特征的最先进的对象发现策略,并在广泛的任务、数据集和评估指标中提供了实验结果的详细比较。
{"title":"Unsupervised Object Localization driven by self-supervised foundation models: A comprehensive review","authors":"Sotirios Papadopoulos ,&nbsp;Emmanouil Patsiouras ,&nbsp;Konstantinos Ioannidis ,&nbsp;Stefanos Vrochidis ,&nbsp;Ioannis Kompatsiaris ,&nbsp;Ioannis Patras","doi":"10.1016/j.imavis.2025.105807","DOIUrl":"10.1016/j.imavis.2025.105807","url":null,"abstract":"<div><div>Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105807"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition 一类不平衡混合CNN-Transformer网络用于人脸属性识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-10 DOI: 10.1016/j.imavis.2025.105823
Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen
Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.
最近的面部属性识别(FAR)方法往往难以捕获全局依赖关系,并且受到严重的类不平衡、类内大变化和类间高相似性的进一步挑战,最终限制了它们的整体性能。为了解决这些挑战,我们提出了一个结合CNN和Transformer的网络,称为Class Imbalance Transformer-CNN (CI-TransCNN),用于人脸属性识别,该网络主要由TransCNN主干和双注意特征融合(Dual Attention Feature Fusion, DAFF)模块组成。在TransCNN中,我们引入了结构自注意(StructSA)来提高图像中结构模式的利用率,并提出了倒残差卷积GLU (IRC-GLU)来增强模型的鲁棒性。这种设计使TransCNN能够有效地捕获多层次和多尺度的特征,同时集成全局和局部信息。DAFF利用空间注意和通道注意,融合从TransCNN中提取的特征,进一步提高特征的可分辨性。此外,为了提高模型在类不平衡、类内变化大、类间相似度高的数据集上的性能,提出了类不平衡二元交叉熵(CIBCE)损失。在CelebA和LFWA数据集上的实验结果表明,与现有的基于CNN和transformer的FAR方法相比,我们的方法有效地解决了类不平衡等问题,并取得了更好的性能。
{"title":"CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition","authors":"Yanfei Liu ,&nbsp;Youchang Shi ,&nbsp;Yufei Long ,&nbsp;Miaosen Xu ,&nbsp;Junhua Chen ,&nbsp;Yuanqian Li ,&nbsp;Hao Wen","doi":"10.1016/j.imavis.2025.105823","DOIUrl":"10.1016/j.imavis.2025.105823","url":null,"abstract":"<div><div>Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105823"},"PeriodicalIF":4.2,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density-aware global–local attention network for point cloud segmentation 基于密度感知的点云分割全局-局部关注网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-08 DOI: 10.1016/j.imavis.2025.105822
Chade Li , Pengju Zhang , Jiaming Zhang , Yihong Wu
3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.
三维点云分割在自动驾驶、增强现实、虚拟现实和数字孪生等领域有着广泛的应用。在真实场景中采集的点云数据往往包含较小的对象和类别,样本量小,现有网络难以处理。为此,我们提出了一种基于密度感知的局部关注与全局关注融合的点云分割网络。其核心思想是增加每个点的有效接受野,同时减少密集区域中小物体的信息丢失。具体来说,我们为不同密度的局部区域划分不同大小的窗口,计算窗口内的注意力。此外,我们将每个局部区域视为整个输入的全局关注的独立令牌。为了平衡对不同类别和大小的对象的处理,还提出了类别响应损失。特别是,我们在网络中间额外设置了一个全连接层来预测对象类别的存在,并构建了一个二元交叉熵损失来响应场景中类别的存在。在实验中,我们的方法在几个公开可用的数据集上,在语义分割和零件分割任务上取得了有竞争力的结果。在充满微小物体的复杂现实场景中获得的点云数据的实验也验证了我们的方法对于小物体和小样本类别的强大分割能力。
{"title":"Density-aware global–local attention network for point cloud segmentation","authors":"Chade Li ,&nbsp;Pengju Zhang ,&nbsp;Jiaming Zhang ,&nbsp;Yihong Wu","doi":"10.1016/j.imavis.2025.105822","DOIUrl":"10.1016/j.imavis.2025.105822","url":null,"abstract":"<div><div>3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105822"},"PeriodicalIF":4.2,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1