首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
CF-SOLT: Real-time and accurate traffic accident detection using correlation filter-based tracking CF-SOLT:利用基于相关滤波器的跟踪技术实时准确地检测交通事故
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-14 DOI: 10.1016/j.imavis.2024.105336
Yingjie Xia , Nan Qian , Lin Guo , Zheming Cai
Traffic accident detection using video surveillance is valuable research work in intelligent transportation systems. It is useful for responding to traffic accidents promptly that can avoid traffic jam or prevent secondary accident. In traffic accident detection, tracking occluded vehicles in real-time and accurately is one of the major sticking points for practical applications. In order to improve the tracking of occluded vehicles for traffic accident detection, this paper proposes a simple online tracking scheme with correlation filters (CF-SOLT). The CF-SOLT method utilizes a correlation filter-based auxiliary tracker to assist the main tracker. This auxiliary tracker helps prevent target ID switching caused by occlusion, enabling accurate vehicle tracking in occluded scenes. Based on the tracking results, a precise traffic accident detection algorithm is developed by integrating behavior analysis of both vehicles and pedestrians. The improved accident detection algorithm with the correlation filter-based auxiliary tracker can provide shorter response time, enabling quick identification and detection of traffic accidents. The experiments are conducted on the VisDrone2019, MOT-Traffic and Dataset of accident to evaluate the performances metrics of MOTA, IDF1, FPS, precision, response time and others. The results show that CF-SOLT improves MOTA and IDF1 by 5.3% and 6.7%, accident detection precision by 25%, and reduces response time by 56 s.
利用视频监控进行交通事故检测是智能交通系统中一项有价值的研究工作。它有助于及时应对交通事故,避免交通堵塞或防止二次事故的发生。在交通事故检测中,实时、准确地跟踪隐蔽车辆是实际应用中的一大难点。为了提高交通事故检测中对隐蔽车辆的跟踪能力,本文提出了一种简单的相关滤波器在线跟踪方案(CF-SOLT)。CF-SOLT 方法利用基于相关滤波器的辅助跟踪器来辅助主跟踪器。该辅助跟踪器有助于防止遮挡导致的目标 ID 切换,从而在遮挡场景中实现精确的车辆跟踪。在跟踪结果的基础上,通过整合车辆和行人的行为分析,开发了一种精确的交通事故检测算法。改进后的事故检测算法采用基于相关滤波器的辅助跟踪器,可以缩短响应时间,实现交通事故的快速识别和检测。实验在 VisDrone2019、MOT-Traffic 和事故数据集上进行,评估了 MOTA、IDF1、FPS、精度、响应时间等性能指标。结果表明,CF-SOLT 的 MOTA 和 IDF1 分别提高了 5.3% 和 6.7%,事故检测精度提高了 25%,响应时间缩短了 56 秒。
{"title":"CF-SOLT: Real-time and accurate traffic accident detection using correlation filter-based tracking","authors":"Yingjie Xia ,&nbsp;Nan Qian ,&nbsp;Lin Guo ,&nbsp;Zheming Cai","doi":"10.1016/j.imavis.2024.105336","DOIUrl":"10.1016/j.imavis.2024.105336","url":null,"abstract":"<div><div>Traffic accident detection using video surveillance is valuable research work in intelligent transportation systems. It is useful for responding to traffic accidents promptly that can avoid traffic jam or prevent secondary accident. In traffic accident detection, tracking occluded vehicles in real-time and accurately is one of the major sticking points for practical applications. In order to improve the tracking of occluded vehicles for traffic accident detection, this paper proposes a simple online tracking scheme with correlation filters (CF-SOLT). The CF-SOLT method utilizes a correlation filter-based auxiliary tracker to assist the main tracker. This auxiliary tracker helps prevent target ID switching caused by occlusion, enabling accurate vehicle tracking in occluded scenes. Based on the tracking results, a precise traffic accident detection algorithm is developed by integrating behavior analysis of both vehicles and pedestrians. The improved accident detection algorithm with the correlation filter-based auxiliary tracker can provide shorter response time, enabling quick identification and detection of traffic accidents. The experiments are conducted on the VisDrone2019, MOT-Traffic and Dataset of accident to evaluate the performances metrics of MOTA, IDF1, FPS, precision, response time and others. The results show that CF-SOLT improves MOTA and IDF1 by 5.3% and 6.7%, accident detection precision by 25%, and reduces response time by 56 s.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105336"},"PeriodicalIF":4.2,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TransWild: Enhancing 3D interacting hands recovery in the wild with IoU-guided Transformer TransWild:利用 IoU 引导的变形器增强野外 3D 交互手的恢复能力
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-12 DOI: 10.1016/j.imavis.2024.105316
Wanru Zhu , Yichen Zhang , Ke Chen , Lihua Guo
The recovery of 3D interacting hands meshes in the wild (ITW) is crucial for 3D full-body mesh reconstruction, especially when limited 3D annotations are available. The recent ITW interacting hands recovery method brings two hands to a shared 2D scale space and achieves effective learning of ITW datasets. However, they lack the deep exploitation of the intrinsic interaction dynamics of hands. In this work, we propose TransWild, a novel framework for 3D interactive hand mesh recovery that leverages a weight-shared Intersection-of-Union (IoU) guided Transformer for feature interaction. Based on harmonizing ITW and MoCap datasets within a unified 2D scale space, our hand feature interaction mechanism powered by an IoU-guided Transformer enables a more accurate estimation of interacting hands. This innovation stems from the observation that hand detection yields valuable IoU of two hands bounding box, therefore, an IOU-guided Transformer can significantly enrich the Transformer’s ability to decode and integrate these insights into the interactive hand recovery process. To ensure consistent training outcomes, we have developed a strategy for training with augmented ground truth bounding boxes to address inherent variability. Quantitative evaluations across two prominent benchmarks for 3D interacting hands underscore our method’s superior performance. The code will be released after acceptance.
野外三维交互手网格(ITW)的恢复对于三维全身网格重建至关重要,尤其是在三维注释有限的情况下。最近的 ITW 交互手恢复方法将两只手带到了一个共享的二维尺度空间,并实现了对 ITW 数据集的有效学习。然而,这些方法缺乏对手的内在交互动力学的深入开发。在这项工作中,我们提出了 TransWild,这是一种用于三维交互式手部网格恢复的新型框架,它利用权重共享的联合交叉(IoU)引导变换器来实现特征交互。在统一的二维尺度空间内协调 ITW 和 MoCap 数据集的基础上,我们的手部特征交互机制由 IoU 引导的变换器提供动力,能够更准确地估计交互的手部特征。这一创新源于我们的观察,即手部检测会产生有价值的两只手边界框的 IoU,因此,IOU 引导变形器可以极大地丰富变形器的解码能力,并将这些见解整合到交互式手部恢复过程中。为了确保训练结果的一致性,我们开发了一种使用增强型地面真实边界框进行训练的策略,以解决固有的可变性问题。在两个著名的三维交互手部基准中进行的定量评估强调了我们方法的卓越性能。代码将在验收后发布。
{"title":"TransWild: Enhancing 3D interacting hands recovery in the wild with IoU-guided Transformer","authors":"Wanru Zhu ,&nbsp;Yichen Zhang ,&nbsp;Ke Chen ,&nbsp;Lihua Guo","doi":"10.1016/j.imavis.2024.105316","DOIUrl":"10.1016/j.imavis.2024.105316","url":null,"abstract":"<div><div>The recovery of 3D interacting hands meshes in the wild (ITW) is crucial for 3D full-body mesh reconstruction, especially when limited 3D annotations are available. The recent ITW interacting hands recovery method brings two hands to a shared 2D scale space and achieves effective learning of ITW datasets. However, they lack the deep exploitation of the intrinsic interaction dynamics of hands. In this work, we propose TransWild, a novel framework for 3D interactive hand mesh recovery that leverages a weight-shared Intersection-of-Union (IoU) guided Transformer for feature interaction. Based on harmonizing ITW and MoCap datasets within a unified 2D scale space, our hand feature interaction mechanism powered by an IoU-guided Transformer enables a more accurate estimation of interacting hands. This innovation stems from the observation that hand detection yields valuable IoU of two hands bounding box, therefore, an IOU-guided Transformer can significantly enrich the Transformer’s ability to decode and integrate these insights into the interactive hand recovery process. To ensure consistent training outcomes, we have developed a strategy for training with augmented ground truth bounding boxes to address inherent variability. Quantitative evaluations across two prominent benchmarks for 3D interacting hands underscore our method’s superior performance. The code will be released after acceptance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105316"},"PeriodicalIF":4.2,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning applications in breast cancer prediction using mammography 利用乳房 X 射线摄影预测乳腺癌的机器学习应用
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-10 DOI: 10.1016/j.imavis.2024.105338
G.M. Harshvardhan , Kei Mori , Sarika Verma , Lambros Athanasiou
Breast cancer is the second leading cause of cancer-related deaths among women. Early detection of lumps and subsequent risk assessment significantly improves prognosis. In screening mammography, radiologist interpretation of mammograms is prone to high error rates and requires extensive manual effort. To this end, several computer-aided diagnosis methods using machine learning have been proposed for automatic detection of breast cancer in mammography. In this paper, we provide a comprehensive review and analysis of these methods and discuss practical issues associated with their reproducibility. We aim to aid the readers in choosing the appropriate method to implement and we guide them towards this purpose. Moreover, an effort is made to re-implement a sample of the presented methods in order to highlight the importance of providing technical details associated with those methods. Advancing the domain of breast cancer pathology classification using machine learning involves the availability of public databases and development of innovative methods. Although there is significant progress in both areas, more transparency in the latter would boost the domain progress.
乳腺癌是导致女性癌症相关死亡的第二大原因。早期发现肿块并进行风险评估可大大改善预后。在乳房 X 光筛查中,放射科医生对乳房 X 光照片的判读容易出现高错误率,而且需要大量的人工操作。为此,人们提出了几种使用机器学习的计算机辅助诊断方法,用于在乳房 X 射线照相术中自动检测乳腺癌。在本文中,我们对这些方法进行了全面的回顾和分析,并讨论了与这些方法的可重复性相关的实际问题。我们的目的是帮助读者选择合适的方法,并引导他们实现这一目标。此外,我们还努力重新实施了所介绍方法的一个样本,以强调提供与这些方法相关的技术细节的重要性。利用机器学习推进乳腺癌病理分类领域的发展涉及公共数据库的可用性和创新方法的开发。尽管在这两个领域都取得了重大进展,但提高后者的透明度将促进该领域的进步。
{"title":"Machine learning applications in breast cancer prediction using mammography","authors":"G.M. Harshvardhan ,&nbsp;Kei Mori ,&nbsp;Sarika Verma ,&nbsp;Lambros Athanasiou","doi":"10.1016/j.imavis.2024.105338","DOIUrl":"10.1016/j.imavis.2024.105338","url":null,"abstract":"<div><div>Breast cancer is the second leading cause of cancer-related deaths among women. Early detection of lumps and subsequent risk assessment significantly improves prognosis. In screening mammography, radiologist interpretation of mammograms is prone to high error rates and requires extensive manual effort. To this end, several computer-aided diagnosis methods using machine learning have been proposed for automatic detection of breast cancer in mammography. In this paper, we provide a comprehensive review and analysis of these methods and discuss practical issues associated with their reproducibility. We aim to aid the readers in choosing the appropriate method to implement and we guide them towards this purpose. Moreover, an effort is made to re-implement a sample of the presented methods in order to highlight the importance of providing technical details associated with those methods. Advancing the domain of breast cancer pathology classification using machine learning involves the availability of public databases and development of innovative methods. Although there is significant progress in both areas, more transparency in the latter would boost the domain progress.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105338"},"PeriodicalIF":4.2,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Channel and Spatial Enhancement Network for human parsing 用于人类解析的通道和空间增强网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-08 DOI: 10.1016/j.imavis.2024.105332
Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang
The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.
用于场景解析的神经网络的主要骨干由多个阶段组成,不同阶段的特征图通常包含不同程度的空间和语义信息。高级特征传递更多语义,空间细节较少,而低级特征则语义较少,空间细节较多。因此,不同层次的特征之间存在语义和空间上的差距,尤其是在人类解析任务中。现有的许多方法都是直接对多级特征进行上采样,然后通过加法或并集的方式将它们聚合在一起,而没有解决这些特征之间存在的语义空间差距问题。这不可避免地会导致空间不对齐、语义不匹配,并最终导致解析过程中的错误分类,尤其是对于人类解析来说,由于复杂的纹理、多样的服装风格以及不同人体部位的严重尺度变化,人类解析需要更多的语义信息和更精细的特征图细节。在本文中,我们通过创新性地利用减法和加法运算来识别语义和空间差异,并对其进行补偿,从而有效地缓解了长期以来解决不同阶段特征之间语义和空间差距的难题。基于这些原理,我们提出了用于解析的通道和空间增强网络(CSENet),通过向低级特征注入高语义信息,反之亦然,向高级特征引入精细细节,为解决语义空间差距问题提供了一种简单直观的解决方案。在三个密集预测任务中进行的广泛实验证明了我们方法的有效性。具体来说,我们的方法在 LIP 和 CIHP 数据集上取得了最佳性能,我们还在 ADE20K 数据集上验证了我们方法的通用性。
{"title":"Channel and Spatial Enhancement Network for human parsing","authors":"Kunliang Liu ,&nbsp;Rize Jin ,&nbsp;Yuelong Li ,&nbsp;Jianming Wang ,&nbsp;Wonjun Hwang","doi":"10.1016/j.imavis.2024.105332","DOIUrl":"10.1016/j.imavis.2024.105332","url":null,"abstract":"<div><div>The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105332"},"PeriodicalIF":4.2,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-negative subspace feature representation for few-shot learning in medical imaging 用于医学成像中少镜头学习的非负子空间特征表征
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-07 DOI: 10.1016/j.imavis.2024.105334
Keqiang Fan, Xiaohao Cai, Mahesan Niranjan
Unlike typical visual scene recognition tasks, where massive datasets are available to train deep neural networks (DNNs), medical image diagnosis using DNNs often faces challenges due to data scarcity. In this paper, we investigate the effectiveness of data-based few-shot learning in medical imaging by exploring different data attribute representations in a low-dimensional space. We introduce different types of non-negative matrix factorization (NMF) in few-shot learning to investigate the information preserved in the subspace resulting from dimensionality reduction, which is crucial to mitigate the data scarcity problem in medical image classification. Extensive empirical studies are conducted in terms of validating the effectiveness of NMF, especially its supervised variants (e.g., discriminative NMF, and supervised and constrained NMF with sparseness), and the comparison with principal component analysis (PCA), i.e., the collaborative representation-based dimensionality reduction technique derived from eigenvectors. With 14 different datasets covering 11 distinct illness categories, thorough experimental results and comparison with related techniques demonstrate that NMF is a competitive alternative to PCA for few-shot learning in medical imaging, and the supervised NMF algorithms are more discriminative in the subspace with greater effectiveness. Furthermore, we show that the part-based representation of NMF, especially its supervised variants, is dramatically impactful in detecting lesion areas in medical imaging with limited samples.
在典型的视觉场景识别任务中,有大量数据集可用于训练深度神经网络(DNN),与此不同的是,由于数据稀缺,使用 DNN 进行医学图像诊断往往面临挑战。在本文中,我们通过探索低维空间中不同的数据属性表征,研究了基于数据的医疗成像中少数几次学习的有效性。我们在少量学习中引入了不同类型的非负矩阵因式分解(NMF),以研究降维后子空间中保留的信息,这对于缓解医学影像分类中的数据稀缺问题至关重要。在验证 NMF 的有效性方面进行了广泛的实证研究,特别是其有监督的变体(如判别 NMF、有监督和受约束的稀疏性 NMF),以及与主成分分析(PCA)(即基于特征向量的协作表示降维技术)的比较。通过涵盖 11 个不同疾病类别的 14 个不同数据集,全面的实验结果以及与相关技术的比较表明,NMF 在医学影像中的少量学习中是一种具有竞争力的 PCA 替代方案,并且有监督的 NMF 算法在子空间中具有更高的判别能力和更强的有效性。此外,我们还证明了 NMF 基于部分的表示法,尤其是其监督变体,在样本有限的医学成像中对病变区域的检测具有显著的影响。
{"title":"Non-negative subspace feature representation for few-shot learning in medical imaging","authors":"Keqiang Fan,&nbsp;Xiaohao Cai,&nbsp;Mahesan Niranjan","doi":"10.1016/j.imavis.2024.105334","DOIUrl":"10.1016/j.imavis.2024.105334","url":null,"abstract":"<div><div>Unlike typical visual scene recognition tasks, where massive datasets are available to train deep neural networks (DNNs), medical image diagnosis using DNNs often faces challenges due to data scarcity. In this paper, we investigate the effectiveness of data-based few-shot learning in medical imaging by exploring different data attribute representations in a low-dimensional space. We introduce different types of non-negative matrix factorization (NMF) in few-shot learning to investigate the information preserved in the subspace resulting from dimensionality reduction, which is crucial to mitigate the data scarcity problem in medical image classification. Extensive empirical studies are conducted in terms of validating the effectiveness of NMF, especially its supervised variants (e.g., discriminative NMF, and supervised and constrained NMF with sparseness), and the comparison with principal component analysis (PCA), i.e., the collaborative representation-based dimensionality reduction technique derived from eigenvectors. With 14 different datasets covering 11 distinct illness categories, thorough experimental results and comparison with related techniques demonstrate that NMF is a competitive alternative to PCA for few-shot learning in medical imaging, and the supervised NMF algorithms are more discriminative in the subspace with greater effectiveness. Furthermore, we show that the part-based representation of NMF, especially its supervised variants, is dramatically impactful in detecting lesion areas in medical imaging with limited samples.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105334"},"PeriodicalIF":4.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RGB-T tracking with frequency hybrid awareness 具有混频意识的 RGB-T 跟踪
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-06 DOI: 10.1016/j.imavis.2024.105330
Lei Lei, Xianxian Li
Recently, impressive progress has been made with transformer-based RGB-T trackers due to the transformer’s effectiveness in capturing low-frequency information (i.e., high-level semantic information). However, some studies have revealed that the transformer exhibits limitations in capturing high-frequency information (i.e., low-level texture and edge details), thereby restricting the tracker’s capacity to precisely match target details within the search area. To address this issue, we propose a frequency hybrid awareness modeling RGB-T tracker, abbreviated as FHAT. Specifically, FHAT combines the advantages of convolution and maximum pooling in capturing high-frequency information on the architecture of transformer. In this way, it strengthens the high-frequency features and enhances the model’s perception of detailed information. Additionally, to enhance the complementary effect between the two modalities, the tracker utilizes low-frequency information from both modalities for modality interaction, which can avoid interaction errors caused by inconsistent local details of the multimodality. Furthermore, these high-frequency information and interaction low-frequency information will then be fused, allowing the model to adaptively enhance the frequency features of the modal expression. Through extensive experiments on two mainstream RGB-T tracking benchmarks, our method demonstrates competitive performance.
最近,基于变换器的 RGB-T 追踪器取得了令人瞩目的进展,这是因为变换器能有效捕捉低频信息(即高级语义信息)。然而,一些研究表明,变换器在捕捉高频信息(即低级纹理和边缘细节)方面表现出局限性,从而限制了跟踪器精确匹配搜索区域内目标细节的能力。为了解决这个问题,我们提出了一种频率混合感知建模 RGB-T 追踪器,简称 FHAT。具体来说,FHAT 结合了卷积和最大池化在捕捉变压器架构上的高频信息方面的优势。这样,它就能强化高频特征,增强模型对细节信息的感知。此外,为了增强两种模态之间的互补效果,跟踪器利用两种模态的低频信息进行模态交互,这样可以避免多模态局部细节不一致造成的交互误差。此外,这些高频信息和交互低频信息将被融合,从而使模型能够自适应地增强模态表达的频率特性。通过对两个主流 RGB-T 跟踪基准的广泛实验,我们的方法展示了具有竞争力的性能。
{"title":"RGB-T tracking with frequency hybrid awareness","authors":"Lei Lei,&nbsp;Xianxian Li","doi":"10.1016/j.imavis.2024.105330","DOIUrl":"10.1016/j.imavis.2024.105330","url":null,"abstract":"<div><div>Recently, impressive progress has been made with transformer-based RGB-T trackers due to the transformer’s effectiveness in capturing low-frequency information (i.e., high-level semantic information). However, some studies have revealed that the transformer exhibits limitations in capturing high-frequency information (i.e., low-level texture and edge details), thereby restricting the tracker’s capacity to precisely match target details within the search area. To address this issue, we propose a frequency hybrid awareness modeling RGB-T tracker, abbreviated as FHAT. Specifically, FHAT combines the advantages of convolution and maximum pooling in capturing high-frequency information on the architecture of transformer. In this way, it strengthens the high-frequency features and enhances the model’s perception of detailed information. Additionally, to enhance the complementary effect between the two modalities, the tracker utilizes low-frequency information from both modalities for modality interaction, which can avoid interaction errors caused by inconsistent local details of the multimodality. Furthermore, these high-frequency information and interaction low-frequency information will then be fused, allowing the model to adaptively enhance the frequency features of the modal expression. Through extensive experiments on two mainstream RGB-T tracking benchmarks, our method demonstrates competitive performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105330"},"PeriodicalIF":4.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification 文本增强型多模态对比学习用于无监督可见红外人员再识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-05 DOI: 10.1016/j.imavis.2024.105310
Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang
Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.
可见光-红外人员再识别技术对智能安防具有重要意义。无监督方法可以缩小不同模态无标签的差距。以往的大多数无监督方法仅使用图像信息训练模型,因此模型无法获得强大的深层语义信息。在本文中,我们利用 CLIP 来提取深度文本信息。我们提出了文本-图像对齐(TIA)模块,以对齐图像和文本信息,有效弥合可见光和红外模式之间的差距。我们制作了一个局部-全局图像匹配(LGIM)模块来查找同质信息。具体来说,我们采用匈牙利算法和模拟退火(SA)算法,从图像特征中获取原始信息,同时减少异构信息的干扰。此外,我们还设计了一种可变跨模态对齐损失(CCAL),使模型能够在不同的训练阶段学习特定模态的特征。我们的方法性能良好,并通过有针对性的学习获得了强大的鲁棒性。广泛的实验证明了我们方法的有效性,我们的方法在 RegDB 上的排名-1 准确率比最先进的方法高出约 10%。
{"title":"Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification","authors":"Rui Sun ,&nbsp;Guoxi Huang ,&nbsp;Xuebin Wang ,&nbsp;Yun Du ,&nbsp;Xudong Zhang","doi":"10.1016/j.imavis.2024.105310","DOIUrl":"10.1016/j.imavis.2024.105310","url":null,"abstract":"<div><div>Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105310"},"PeriodicalIF":4.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-grained semantic oriented embedding set alignment for text-based person search 基于文本的人物搜索的细粒度语义导向嵌入集对齐
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-05 DOI: 10.1016/j.imavis.2024.105309
Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao
Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.
基于文本的人物搜索旨在检索与给定文本描述语义高度相关的人物图像。这种检索任务的难点在于模态异质性和细粒度匹配。大多数现有方法只考虑使用全局特征进行配准,而忽略了细粒度匹配问题。跨模态注意力交互通常用于图像补丁和文本标记的直接配准。然而,跨模态注意力可能会在推理阶段造成巨大的开销,无法应用于实际场景。此外,由于图像补丁和文本标记不具备完整的语义信息,因此进行补丁-标记对齐是不合理的。本文提出了一种用于细粒度对齐的嵌入集对齐(ESA)模块。该模块可通过将标记级特征合并到嵌入集来保留细粒度语义信息。ESA 模块得益于预先训练好的跨模态大型模型,它可以与骨干网非侵入式结合,并以端到端的方式进行训练。此外,我们还设计了自适应语义边际(ASM)损失来描述嵌入集的对齐情况,而不是采用具有固定边际的损失函数。广泛的实验证明,我们提出的细粒度语义嵌入集对齐方法在三个流行的基准数据集上取得了一流的性能,超过了以前的最佳方法。
{"title":"Fine-grained semantic oriented embedding set alignment for text-based person search","authors":"Jiaqi Zhao ,&nbsp;Ao Fu ,&nbsp;Yong Zhou ,&nbsp;Wen-liang Du ,&nbsp;Rui Yao","doi":"10.1016/j.imavis.2024.105309","DOIUrl":"10.1016/j.imavis.2024.105309","url":null,"abstract":"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105309"},"PeriodicalIF":4.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAFENet: Semantic-Aware Feature Enhancement Network for unsupervised cross-domain road scene segmentation SAFENet:用于无监督跨域道路场景分割的语义感知特征增强网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-04 DOI: 10.1016/j.imavis.2024.105318
Dexin Ren , Minxian Li , Shidong Wang , Mingwu Ren , Haofeng Zhang
Unsupervised cross-domain road scene segmentation has attracted substantial interest because of its capability to perform segmentation on new and unlabeled domains, thereby reducing the dependence on expensive manual annotations. This is achieved by leveraging networks trained on labeled source domains to classify images on unlabeled target domains. Conventional techniques usually use adversarial networks to align inputs from the source and the target in either of their domains. However, these approaches often fall short in effectively integrating information from both domains due to Alignment in each space usually leads to bias problems during feature learning. To overcome these limitations and enhance cross-domain interaction while mitigating overfitting to the source domain, we introduce a novel framework called Semantic-Aware Feature Enhancement Network (SAFENet) for Unsupervised Cross-domain Road Scene Segmentation. SAFENet incorporates the Semantic-Aware Enhancement (SAE) module to amplify the importance of class information in segmentation tasks and uses the semantic space as a new domain to guide the alignment of the source and target domains. Additionally, we integrate Adaptive Instance Normalization with Momentum (AdaIN-M) techniques, which convert the source domain image style to the target domain image style, thereby reducing the adverse effects of source domain overfitting on target domain segmentation performance. Moreover, SAFENet employs a Knowledge Transfer (KT) module to optimize network architecture, enhancing computational efficiency during testing while maintaining the robust inference capabilities developed during training. To further improve the segmentation performance, we further employ Curriculum Learning, a self-training mechanism that uses pseudo-labels derived from the target domain to iteratively refine the network. Comprehensive experiments on three well-known datasets, “SynthiaCityscapes” and “GTA5Cityscapes”, demonstrate the superior performance of our method. In-depth examinations and ablation studies verify the efficacy of each module within the proposed method.
无监督跨域道路场景分割因其能够在新的和未标记的域上执行分割,从而减少对昂贵的人工标注的依赖而引起了广泛关注。要实现这一点,就必须利用在已标注源域上训练的网络来对未标注目标域上的图像进行分类。传统技术通常使用对抗网络对来自源域和目标域的输入进行对齐。然而,这些方法往往无法有效整合两个域的信息,因为在特征学习过程中,每个空间的对齐通常会导致偏差问题。为了克服这些局限性,在增强跨域交互的同时减少源域的过拟合,我们引入了一种名为语义感知特征增强网络(SAFENet)的新型框架,用于无监督跨域道路场景分割。SAFENet 融合了语义感知增强(SAE)模块,以提高类别信息在分割任务中的重要性,并将语义空间作为一个新域来指导源域和目标域的对齐。此外,我们还集成了带动量的自适应实例归一化(AdaIN-M)技术,将源域图像风格转换为目标域图像风格,从而降低源域过拟合对目标域分割性能的不利影响。此外,SAFENet 还采用了知识转移(KT)模块来优化网络结构,提高了测试过程中的计算效率,同时保持了训练过程中形成的强大推理能力。为了进一步提高分割性能,我们还进一步采用了课程学习(Curriculum Learning),这是一种自我训练机制,它使用从目标领域获得的伪标签来迭代完善网络。在 "Synthia→Cityscapes "和 "GTA5→Cityscapes "这三个著名的数据集上进行的综合实验证明了我们的方法具有卓越的性能。深入的检查和消融研究验证了建议方法中每个模块的功效。
{"title":"SAFENet: Semantic-Aware Feature Enhancement Network for unsupervised cross-domain road scene segmentation","authors":"Dexin Ren ,&nbsp;Minxian Li ,&nbsp;Shidong Wang ,&nbsp;Mingwu Ren ,&nbsp;Haofeng Zhang","doi":"10.1016/j.imavis.2024.105318","DOIUrl":"10.1016/j.imavis.2024.105318","url":null,"abstract":"<div><div>Unsupervised cross-domain road scene segmentation has attracted substantial interest because of its capability to perform segmentation on new and unlabeled domains, thereby reducing the dependence on expensive manual annotations. This is achieved by leveraging networks trained on labeled source domains to classify images on unlabeled target domains. Conventional techniques usually use adversarial networks to align inputs from the source and the target in either of their domains. However, these approaches often fall short in effectively integrating information from both domains due to Alignment in each space usually leads to bias problems during feature learning. To overcome these limitations and enhance cross-domain interaction while mitigating overfitting to the source domain, we introduce a novel framework called Semantic-Aware Feature Enhancement Network (SAFENet) for Unsupervised Cross-domain Road Scene Segmentation. SAFENet incorporates the Semantic-Aware Enhancement (SAE) module to amplify the importance of class information in segmentation tasks and uses the semantic space as a new domain to guide the alignment of the source and target domains. Additionally, we integrate Adaptive Instance Normalization with Momentum (AdaIN-M) techniques, which convert the source domain image style to the target domain image style, thereby reducing the adverse effects of source domain overfitting on target domain segmentation performance. Moreover, SAFENet employs a Knowledge Transfer (KT) module to optimize network architecture, enhancing computational efficiency during testing while maintaining the robust inference capabilities developed during training. To further improve the segmentation performance, we further employ Curriculum Learning, a self-training mechanism that uses pseudo-labels derived from the target domain to iteratively refine the network. Comprehensive experiments on three well-known datasets, “Synthia<span><math><mo>→</mo></math></span>Cityscapes” and “GTA5<span><math><mo>→</mo></math></span>Cityscapes”, demonstrate the superior performance of our method. In-depth examinations and ablation studies verify the efficacy of each module within the proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105318"},"PeriodicalIF":4.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention enhanced machine instinctive vision with human-inspired saliency detection 利用受人类启发的显著性检测,增强机器本能视觉的注意力
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-04 DOI: 10.1016/j.imavis.2024.105308
Habib Khan , Muhammad Talha Usman , Imad Rida , JaKeoung Koo
Salient object detection (SOD) enables machines to recognize and accurately segment visually prominent regions in images. Despite recent advancements, existing approaches often lack progressive fusion of low and high-level features, effective multi-scale feature handling, and precise boundary detection. Moreover, the robustness of these models under varied lighting conditions remains a concern. To overcome these challenges, we present Attention Enhanced Machine Instinctive Vision framework for SOD. The proposed framework leverages the strategy of Multi-stage Feature Refinement with Optimal Attentions-Driven Framework (MFRNet). The multi-level features are extracted from six stages of the EfficientNet-B7 backbone. This provides effective feature fusions of low and high-level details across various scales at the later stage of the framework. We introduce the Spatial-optimized Feature Attention (SOFA) module, which refines spatial features from three initial-stage feature maps. The extracted multi-scale features from the backbone are passed from the convolution feature transformation and spatial attention mechanisms to refine the low-level information. The SOFA module concatenates and upsamples these refined features, producing a comprehensive spatial representation of various levels. Moreover, the proposed Context-Aware Channel Refinement (CACR) module integrates dilated convolutions with optimized dilation rates followed by channel attention to capture multi-scale contextual information from the mature three layers. Furthermore, our progressive feature fusion strategy combines high-level semantic information and low-level spatial details through multiple residual connections, ensuring robust feature representation and effective gradient backpropagation. To enhance robustness, we train our network with augmented data featuring low and high brightness adjustments, improving its ability to handle diverse lighting conditions. Extensive experiments on four benchmark datasets — ECSSD, HKU-IS, DUTS, and PASCAL-S — validate the proposed framework’s effectiveness, demonstrating superior performance compared to existing SOTA methods in the domain. Code, qualitative results, and trained weights will be available at the link: https://github.com/habib1402/MFRNet-SOD.
突出物体检测(SOD)使机器能够识别并准确分割图像中的视觉突出区域。尽管最近取得了进步,但现有的方法往往缺乏低级和高级特征的渐进融合、有效的多尺度特征处理和精确的边界检测。此外,这些模型在不同光照条件下的鲁棒性仍然令人担忧。为了克服这些挑战,我们提出了用于 SOD 的注意力增强型机器本能视觉框架。所提出的框架利用了多阶段特征提纯与最佳注意力驱动框架(MFRNet)的策略。多级特征是从 EfficientNet-B7 主干网的六个阶段中提取的。这为框架的后期阶段提供了不同尺度的低级和高级细节的有效特征融合。我们引入了空间优化特征关注(SOFA)模块,该模块从三个初始阶段的特征图中提炼空间特征。从骨干图中提取的多尺度特征通过卷积特征变换和空间注意机制来完善低层次信息。SOFA 模块对这些细化的特征进行串联和上采样,生成不同层次的综合空间表示。此外,我们提出的情境感知信道细化(CACR)模块整合了具有优化扩张率的扩张卷积和信道关注,以捕捉来自成熟三层的多尺度情境信息。此外,我们的渐进式特征融合策略通过多个残差连接将高层语义信息和低层空间细节相结合,确保了稳健的特征表示和有效的梯度反向传播。为了增强鲁棒性,我们使用低亮度和高亮度调整的增强数据来训练我们的网络,从而提高其处理不同照明条件的能力。在四个基准数据集(ECSSD、HKU-IS、DUTS 和 PASCAL-S)上进行的广泛实验验证了所提出的框架的有效性,与该领域现有的 SOTA 方法相比,该框架的性能更加优越。代码、定性结果和训练过的权重可通过以下链接获取:https://github.com/habib1402/MFRNet-SOD。
{"title":"Attention enhanced machine instinctive vision with human-inspired saliency detection","authors":"Habib Khan ,&nbsp;Muhammad Talha Usman ,&nbsp;Imad Rida ,&nbsp;JaKeoung Koo","doi":"10.1016/j.imavis.2024.105308","DOIUrl":"10.1016/j.imavis.2024.105308","url":null,"abstract":"<div><div>Salient object detection (SOD) enables machines to recognize and accurately segment visually prominent regions in images. Despite recent advancements, existing approaches often lack progressive fusion of low and high-level features, effective multi-scale feature handling, and precise boundary detection. Moreover, the robustness of these models under varied lighting conditions remains a concern. To overcome these challenges, we present Attention Enhanced Machine Instinctive Vision framework for SOD. The proposed framework leverages the strategy of Multi-stage Feature Refinement with Optimal Attentions-Driven Framework (MFRNet). The multi-level features are extracted from six stages of the EfficientNet-B7 backbone. This provides effective feature fusions of low and high-level details across various scales at the later stage of the framework. We introduce the Spatial-optimized Feature Attention (SOFA) module, which refines spatial features from three initial-stage feature maps. The extracted multi-scale features from the backbone are passed from the convolution feature transformation and spatial attention mechanisms to refine the low-level information. The SOFA module concatenates and upsamples these refined features, producing a comprehensive spatial representation of various levels. Moreover, the proposed Context-Aware Channel Refinement (CACR) module integrates dilated convolutions with optimized dilation rates followed by channel attention to capture multi-scale contextual information from the mature three layers. Furthermore, our progressive feature fusion strategy combines high-level semantic information and low-level spatial details through multiple residual connections, ensuring robust feature representation and effective gradient backpropagation. To enhance robustness, we train our network with augmented data featuring low and high brightness adjustments, improving its ability to handle diverse lighting conditions. Extensive experiments on four benchmark datasets — ECSSD, HKU-IS, DUTS, and PASCAL-S — validate the proposed framework’s effectiveness, demonstrating superior performance compared to existing SOTA methods in the domain. Code, qualitative results, and trained weights will be available at the link: <span><span>https://github.com/habib1402/MFRNet-SOD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105308"},"PeriodicalIF":4.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1