Breast cancer is the second leading cause of cancer-related deaths among women. Early detection of lumps and subsequent risk assessment significantly improves prognosis. In screening mammography, radiologist interpretation of mammograms is prone to high error rates and requires extensive manual effort. To this end, several computer-aided diagnosis methods using machine learning have been proposed for automatic detection of breast cancer in mammography. In this paper, we provide a comprehensive review and analysis of these methods and discuss practical issues associated with their reproducibility. We aim to aid the readers in choosing the appropriate method to implement and we guide them towards this purpose. Moreover, an effort is made to re-implement a sample of the presented methods in order to highlight the importance of providing technical details associated with those methods. Advancing the domain of breast cancer pathology classification using machine learning involves the availability of public databases and development of innovative methods. Although there is significant progress in both areas, more transparency in the latter would boost the domain progress.
乳腺癌是导致女性癌症相关死亡的第二大原因。早期发现肿块并进行风险评估可大大改善预后。在乳房 X 光筛查中,放射科医生对乳房 X 光照片的判读容易出现高错误率,而且需要大量的人工操作。为此,人们提出了几种使用机器学习的计算机辅助诊断方法,用于在乳房 X 射线照相术中自动检测乳腺癌。在本文中,我们对这些方法进行了全面的回顾和分析,并讨论了与这些方法的可重复性相关的实际问题。我们的目的是帮助读者选择合适的方法,并引导他们实现这一目标。此外,我们还努力重新实施了所介绍方法的一个样本,以强调提供与这些方法相关的技术细节的重要性。利用机器学习推进乳腺癌病理分类领域的发展涉及公共数据库的可用性和创新方法的开发。尽管在这两个领域都取得了重大进展,但提高后者的透明度将促进该领域的进步。
{"title":"Machine learning applications in breast cancer prediction using mammography","authors":"G.M. Harshvardhan , Kei Mori , Sarika Verma , Lambros Athanasiou","doi":"10.1016/j.imavis.2024.105338","DOIUrl":"10.1016/j.imavis.2024.105338","url":null,"abstract":"<div><div>Breast cancer is the second leading cause of cancer-related deaths among women. Early detection of lumps and subsequent risk assessment significantly improves prognosis. In screening mammography, radiologist interpretation of mammograms is prone to high error rates and requires extensive manual effort. To this end, several computer-aided diagnosis methods using machine learning have been proposed for automatic detection of breast cancer in mammography. In this paper, we provide a comprehensive review and analysis of these methods and discuss practical issues associated with their reproducibility. We aim to aid the readers in choosing the appropriate method to implement and we guide them towards this purpose. Moreover, an effort is made to re-implement a sample of the presented methods in order to highlight the importance of providing technical details associated with those methods. Advancing the domain of breast cancer pathology classification using machine learning involves the availability of public databases and development of innovative methods. Although there is significant progress in both areas, more transparency in the latter would boost the domain progress.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105338"},"PeriodicalIF":4.2,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.imavis.2024.105332
Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang
The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.
{"title":"Channel and Spatial Enhancement Network for human parsing","authors":"Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang","doi":"10.1016/j.imavis.2024.105332","DOIUrl":"10.1016/j.imavis.2024.105332","url":null,"abstract":"<div><div>The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105332"},"PeriodicalIF":4.2,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1016/j.imavis.2024.105334
Keqiang Fan, Xiaohao Cai, Mahesan Niranjan
Unlike typical visual scene recognition tasks, where massive datasets are available to train deep neural networks (DNNs), medical image diagnosis using DNNs often faces challenges due to data scarcity. In this paper, we investigate the effectiveness of data-based few-shot learning in medical imaging by exploring different data attribute representations in a low-dimensional space. We introduce different types of non-negative matrix factorization (NMF) in few-shot learning to investigate the information preserved in the subspace resulting from dimensionality reduction, which is crucial to mitigate the data scarcity problem in medical image classification. Extensive empirical studies are conducted in terms of validating the effectiveness of NMF, especially its supervised variants (e.g., discriminative NMF, and supervised and constrained NMF with sparseness), and the comparison with principal component analysis (PCA), i.e., the collaborative representation-based dimensionality reduction technique derived from eigenvectors. With 14 different datasets covering 11 distinct illness categories, thorough experimental results and comparison with related techniques demonstrate that NMF is a competitive alternative to PCA for few-shot learning in medical imaging, and the supervised NMF algorithms are more discriminative in the subspace with greater effectiveness. Furthermore, we show that the part-based representation of NMF, especially its supervised variants, is dramatically impactful in detecting lesion areas in medical imaging with limited samples.
{"title":"Non-negative subspace feature representation for few-shot learning in medical imaging","authors":"Keqiang Fan, Xiaohao Cai, Mahesan Niranjan","doi":"10.1016/j.imavis.2024.105334","DOIUrl":"10.1016/j.imavis.2024.105334","url":null,"abstract":"<div><div>Unlike typical visual scene recognition tasks, where massive datasets are available to train deep neural networks (DNNs), medical image diagnosis using DNNs often faces challenges due to data scarcity. In this paper, we investigate the effectiveness of data-based few-shot learning in medical imaging by exploring different data attribute representations in a low-dimensional space. We introduce different types of non-negative matrix factorization (NMF) in few-shot learning to investigate the information preserved in the subspace resulting from dimensionality reduction, which is crucial to mitigate the data scarcity problem in medical image classification. Extensive empirical studies are conducted in terms of validating the effectiveness of NMF, especially its supervised variants (e.g., discriminative NMF, and supervised and constrained NMF with sparseness), and the comparison with principal component analysis (PCA), i.e., the collaborative representation-based dimensionality reduction technique derived from eigenvectors. With 14 different datasets covering 11 distinct illness categories, thorough experimental results and comparison with related techniques demonstrate that NMF is a competitive alternative to PCA for few-shot learning in medical imaging, and the supervised NMF algorithms are more discriminative in the subspace with greater effectiveness. Furthermore, we show that the part-based representation of NMF, especially its supervised variants, is dramatically impactful in detecting lesion areas in medical imaging with limited samples.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105334"},"PeriodicalIF":4.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1016/j.imavis.2024.105331
Jingxin Lin , Kaifan Zhong , Tao Gong , Xianmin Zhang , Nianfeng Wang
This paper proposes neural network architectures for point cloud segmentation, which leverage prior knowledge derived from same-type point clouds. The approach involves concurrent processing of two point clouds: a target point cloud necessitating segmentation and a labeled same-type point cloud. The labeled point cloud provides preliminary labeling information, assisting in segmenting the target point cloud. A feature combination module is proposed to identify and combine corresponding features across the point clouds. The module augments the feature representation of the target cloud and improves its capacity for object discrimination. Experiments on the ShapeNetPart and S3DIS datasets demonstrate that when integrated into classical network architectures, the proposed approach can achieve improved segmentation performance over the corresponding networks, significantly in some of them.
{"title":"Point cloud segmentation neural network with same-type point cloud assistance","authors":"Jingxin Lin , Kaifan Zhong , Tao Gong , Xianmin Zhang , Nianfeng Wang","doi":"10.1016/j.imavis.2024.105331","DOIUrl":"10.1016/j.imavis.2024.105331","url":null,"abstract":"<div><div>This paper proposes neural network architectures for point cloud segmentation, which leverage prior knowledge derived from same-type point clouds. The approach involves concurrent processing of two point clouds: a target point cloud necessitating segmentation and a labeled same-type point cloud. The labeled point cloud provides preliminary labeling information, assisting in segmenting the target point cloud. A feature combination module is proposed to identify and combine corresponding features across the point clouds. The module augments the feature representation of the target cloud and improves its capacity for object discrimination. Experiments on the ShapeNetPart and S3DIS datasets demonstrate that when integrated into classical network architectures, the proposed approach can achieve improved segmentation performance over the corresponding networks, significantly in some of them.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105331"},"PeriodicalIF":4.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142701224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-06DOI: 10.1016/j.imavis.2024.105330
Lei Lei, Xianxian Li
Recently, impressive progress has been made with transformer-based RGB-T trackers due to the transformer’s effectiveness in capturing low-frequency information (i.e., high-level semantic information). However, some studies have revealed that the transformer exhibits limitations in capturing high-frequency information (i.e., low-level texture and edge details), thereby restricting the tracker’s capacity to precisely match target details within the search area. To address this issue, we propose a frequency hybrid awareness modeling RGB-T tracker, abbreviated as FHAT. Specifically, FHAT combines the advantages of convolution and maximum pooling in capturing high-frequency information on the architecture of transformer. In this way, it strengthens the high-frequency features and enhances the model’s perception of detailed information. Additionally, to enhance the complementary effect between the two modalities, the tracker utilizes low-frequency information from both modalities for modality interaction, which can avoid interaction errors caused by inconsistent local details of the multimodality. Furthermore, these high-frequency information and interaction low-frequency information will then be fused, allowing the model to adaptively enhance the frequency features of the modal expression. Through extensive experiments on two mainstream RGB-T tracking benchmarks, our method demonstrates competitive performance.
{"title":"RGB-T tracking with frequency hybrid awareness","authors":"Lei Lei, Xianxian Li","doi":"10.1016/j.imavis.2024.105330","DOIUrl":"10.1016/j.imavis.2024.105330","url":null,"abstract":"<div><div>Recently, impressive progress has been made with transformer-based RGB-T trackers due to the transformer’s effectiveness in capturing low-frequency information (i.e., high-level semantic information). However, some studies have revealed that the transformer exhibits limitations in capturing high-frequency information (i.e., low-level texture and edge details), thereby restricting the tracker’s capacity to precisely match target details within the search area. To address this issue, we propose a frequency hybrid awareness modeling RGB-T tracker, abbreviated as FHAT. Specifically, FHAT combines the advantages of convolution and maximum pooling in capturing high-frequency information on the architecture of transformer. In this way, it strengthens the high-frequency features and enhances the model’s perception of detailed information. Additionally, to enhance the complementary effect between the two modalities, the tracker utilizes low-frequency information from both modalities for modality interaction, which can avoid interaction errors caused by inconsistent local details of the multimodality. Furthermore, these high-frequency information and interaction low-frequency information will then be fused, allowing the model to adaptively enhance the frequency features of the modal expression. Through extensive experiments on two mainstream RGB-T tracking benchmarks, our method demonstrates competitive performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105330"},"PeriodicalIF":4.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1016/j.imavis.2024.105310
Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang
Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.
{"title":"Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification","authors":"Rui Sun , Guoxi Huang , Xuebin Wang , Yun Du , Xudong Zhang","doi":"10.1016/j.imavis.2024.105310","DOIUrl":"10.1016/j.imavis.2024.105310","url":null,"abstract":"<div><div>Visible-infrared person re-identification holds significant implications for intelligent security. Unsupervised methods can reduce the gap of different modalities without labels. Most previous unsupervised methods only train their models with image information, so that the model cannot obtain powerful deep semantic information. In this paper, we leverage CLIP to extract deep text information. We propose a Text–Image Alignment (TIA) module to align the image and text information and effectively bridge the gap between visible and infrared modality. We produce a Local–Global Image Match (LGIM) module to find homogeneous information. Specifically, we employ the Hungarian algorithm and Simulated Annealing (SA) algorithm to attain original information from image features while mitigating the interference of heterogeneous information. Additionally, we design a Changeable Cross-modality Alignment Loss (CCAL) to enable the model to learn modality-specific features during different training stages. Our method performs well and attains powerful robustness by targeted learning. Extensive experiments demonstrate the effectiveness of our approach, our method achieves a rank-1 accuracy that exceeds state-of-the-art approaches by approximately 10% on the RegDB.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105310"},"PeriodicalIF":4.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1016/j.imavis.2024.105309
Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao
Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.
{"title":"Fine-grained semantic oriented embedding set alignment for text-based person search","authors":"Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao","doi":"10.1016/j.imavis.2024.105309","DOIUrl":"10.1016/j.imavis.2024.105309","url":null,"abstract":"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105309"},"PeriodicalIF":4.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1016/j.imavis.2024.105318
Dexin Ren , Minxian Li , Shidong Wang , Mingwu Ren , Haofeng Zhang
Unsupervised cross-domain road scene segmentation has attracted substantial interest because of its capability to perform segmentation on new and unlabeled domains, thereby reducing the dependence on expensive manual annotations. This is achieved by leveraging networks trained on labeled source domains to classify images on unlabeled target domains. Conventional techniques usually use adversarial networks to align inputs from the source and the target in either of their domains. However, these approaches often fall short in effectively integrating information from both domains due to Alignment in each space usually leads to bias problems during feature learning. To overcome these limitations and enhance cross-domain interaction while mitigating overfitting to the source domain, we introduce a novel framework called Semantic-Aware Feature Enhancement Network (SAFENet) for Unsupervised Cross-domain Road Scene Segmentation. SAFENet incorporates the Semantic-Aware Enhancement (SAE) module to amplify the importance of class information in segmentation tasks and uses the semantic space as a new domain to guide the alignment of the source and target domains. Additionally, we integrate Adaptive Instance Normalization with Momentum (AdaIN-M) techniques, which convert the source domain image style to the target domain image style, thereby reducing the adverse effects of source domain overfitting on target domain segmentation performance. Moreover, SAFENet employs a Knowledge Transfer (KT) module to optimize network architecture, enhancing computational efficiency during testing while maintaining the robust inference capabilities developed during training. To further improve the segmentation performance, we further employ Curriculum Learning, a self-training mechanism that uses pseudo-labels derived from the target domain to iteratively refine the network. Comprehensive experiments on three well-known datasets, “SynthiaCityscapes” and “GTA5Cityscapes”, demonstrate the superior performance of our method. In-depth examinations and ablation studies verify the efficacy of each module within the proposed method.
{"title":"SAFENet: Semantic-Aware Feature Enhancement Network for unsupervised cross-domain road scene segmentation","authors":"Dexin Ren , Minxian Li , Shidong Wang , Mingwu Ren , Haofeng Zhang","doi":"10.1016/j.imavis.2024.105318","DOIUrl":"10.1016/j.imavis.2024.105318","url":null,"abstract":"<div><div>Unsupervised cross-domain road scene segmentation has attracted substantial interest because of its capability to perform segmentation on new and unlabeled domains, thereby reducing the dependence on expensive manual annotations. This is achieved by leveraging networks trained on labeled source domains to classify images on unlabeled target domains. Conventional techniques usually use adversarial networks to align inputs from the source and the target in either of their domains. However, these approaches often fall short in effectively integrating information from both domains due to Alignment in each space usually leads to bias problems during feature learning. To overcome these limitations and enhance cross-domain interaction while mitigating overfitting to the source domain, we introduce a novel framework called Semantic-Aware Feature Enhancement Network (SAFENet) for Unsupervised Cross-domain Road Scene Segmentation. SAFENet incorporates the Semantic-Aware Enhancement (SAE) module to amplify the importance of class information in segmentation tasks and uses the semantic space as a new domain to guide the alignment of the source and target domains. Additionally, we integrate Adaptive Instance Normalization with Momentum (AdaIN-M) techniques, which convert the source domain image style to the target domain image style, thereby reducing the adverse effects of source domain overfitting on target domain segmentation performance. Moreover, SAFENet employs a Knowledge Transfer (KT) module to optimize network architecture, enhancing computational efficiency during testing while maintaining the robust inference capabilities developed during training. To further improve the segmentation performance, we further employ Curriculum Learning, a self-training mechanism that uses pseudo-labels derived from the target domain to iteratively refine the network. Comprehensive experiments on three well-known datasets, “Synthia<span><math><mo>→</mo></math></span>Cityscapes” and “GTA5<span><math><mo>→</mo></math></span>Cityscapes”, demonstrate the superior performance of our method. In-depth examinations and ablation studies verify the efficacy of each module within the proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105318"},"PeriodicalIF":4.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1016/j.imavis.2024.105308
Habib Khan , Muhammad Talha Usman , Imad Rida , JaKeoung Koo
Salient object detection (SOD) enables machines to recognize and accurately segment visually prominent regions in images. Despite recent advancements, existing approaches often lack progressive fusion of low and high-level features, effective multi-scale feature handling, and precise boundary detection. Moreover, the robustness of these models under varied lighting conditions remains a concern. To overcome these challenges, we present Attention Enhanced Machine Instinctive Vision framework for SOD. The proposed framework leverages the strategy of Multi-stage Feature Refinement with Optimal Attentions-Driven Framework (MFRNet). The multi-level features are extracted from six stages of the EfficientNet-B7 backbone. This provides effective feature fusions of low and high-level details across various scales at the later stage of the framework. We introduce the Spatial-optimized Feature Attention (SOFA) module, which refines spatial features from three initial-stage feature maps. The extracted multi-scale features from the backbone are passed from the convolution feature transformation and spatial attention mechanisms to refine the low-level information. The SOFA module concatenates and upsamples these refined features, producing a comprehensive spatial representation of various levels. Moreover, the proposed Context-Aware Channel Refinement (CACR) module integrates dilated convolutions with optimized dilation rates followed by channel attention to capture multi-scale contextual information from the mature three layers. Furthermore, our progressive feature fusion strategy combines high-level semantic information and low-level spatial details through multiple residual connections, ensuring robust feature representation and effective gradient backpropagation. To enhance robustness, we train our network with augmented data featuring low and high brightness adjustments, improving its ability to handle diverse lighting conditions. Extensive experiments on four benchmark datasets — ECSSD, HKU-IS, DUTS, and PASCAL-S — validate the proposed framework’s effectiveness, demonstrating superior performance compared to existing SOTA methods in the domain. Code, qualitative results, and trained weights will be available at the link: https://github.com/habib1402/MFRNet-SOD.
突出物体检测(SOD)使机器能够识别并准确分割图像中的视觉突出区域。尽管最近取得了进步,但现有的方法往往缺乏低级和高级特征的渐进融合、有效的多尺度特征处理和精确的边界检测。此外,这些模型在不同光照条件下的鲁棒性仍然令人担忧。为了克服这些挑战,我们提出了用于 SOD 的注意力增强型机器本能视觉框架。所提出的框架利用了多阶段特征提纯与最佳注意力驱动框架(MFRNet)的策略。多级特征是从 EfficientNet-B7 主干网的六个阶段中提取的。这为框架的后期阶段提供了不同尺度的低级和高级细节的有效特征融合。我们引入了空间优化特征关注(SOFA)模块,该模块从三个初始阶段的特征图中提炼空间特征。从骨干图中提取的多尺度特征通过卷积特征变换和空间注意机制来完善低层次信息。SOFA 模块对这些细化的特征进行串联和上采样,生成不同层次的综合空间表示。此外,我们提出的情境感知信道细化(CACR)模块整合了具有优化扩张率的扩张卷积和信道关注,以捕捉来自成熟三层的多尺度情境信息。此外,我们的渐进式特征融合策略通过多个残差连接将高层语义信息和低层空间细节相结合,确保了稳健的特征表示和有效的梯度反向传播。为了增强鲁棒性,我们使用低亮度和高亮度调整的增强数据来训练我们的网络,从而提高其处理不同照明条件的能力。在四个基准数据集(ECSSD、HKU-IS、DUTS 和 PASCAL-S)上进行的广泛实验验证了所提出的框架的有效性,与该领域现有的 SOTA 方法相比,该框架的性能更加优越。代码、定性结果和训练过的权重可通过以下链接获取:https://github.com/habib1402/MFRNet-SOD。
{"title":"Attention enhanced machine instinctive vision with human-inspired saliency detection","authors":"Habib Khan , Muhammad Talha Usman , Imad Rida , JaKeoung Koo","doi":"10.1016/j.imavis.2024.105308","DOIUrl":"10.1016/j.imavis.2024.105308","url":null,"abstract":"<div><div>Salient object detection (SOD) enables machines to recognize and accurately segment visually prominent regions in images. Despite recent advancements, existing approaches often lack progressive fusion of low and high-level features, effective multi-scale feature handling, and precise boundary detection. Moreover, the robustness of these models under varied lighting conditions remains a concern. To overcome these challenges, we present Attention Enhanced Machine Instinctive Vision framework for SOD. The proposed framework leverages the strategy of Multi-stage Feature Refinement with Optimal Attentions-Driven Framework (MFRNet). The multi-level features are extracted from six stages of the EfficientNet-B7 backbone. This provides effective feature fusions of low and high-level details across various scales at the later stage of the framework. We introduce the Spatial-optimized Feature Attention (SOFA) module, which refines spatial features from three initial-stage feature maps. The extracted multi-scale features from the backbone are passed from the convolution feature transformation and spatial attention mechanisms to refine the low-level information. The SOFA module concatenates and upsamples these refined features, producing a comprehensive spatial representation of various levels. Moreover, the proposed Context-Aware Channel Refinement (CACR) module integrates dilated convolutions with optimized dilation rates followed by channel attention to capture multi-scale contextual information from the mature three layers. Furthermore, our progressive feature fusion strategy combines high-level semantic information and low-level spatial details through multiple residual connections, ensuring robust feature representation and effective gradient backpropagation. To enhance robustness, we train our network with augmented data featuring low and high brightness adjustments, improving its ability to handle diverse lighting conditions. Extensive experiments on four benchmark datasets — ECSSD, HKU-IS, DUTS, and PASCAL-S — validate the proposed framework’s effectiveness, demonstrating superior performance compared to existing SOTA methods in the domain. Code, qualitative results, and trained weights will be available at the link: <span><span>https://github.com/habib1402/MFRNet-SOD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105308"},"PeriodicalIF":4.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1016/j.imavis.2024.105305
Feng Hao, Fujin Zhong, Yunhe Wang, Hong Yu, Jun Hu, Yan Yang
{"title":"Corrigendum to “STAFFormer: Spatio-temporal adaptive fusion transformer for efficient 3D human pose estimation” [Journal of Image and Vision Computing volume 149 (2024) 105142]","authors":"Feng Hao, Fujin Zhong, Yunhe Wang, Hong Yu, Jun Hu, Yan Yang","doi":"10.1016/j.imavis.2024.105305","DOIUrl":"10.1016/j.imavis.2024.105305","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105305"},"PeriodicalIF":4.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}