Image and Vision Computing最新文献

Rethinking the sample relations for few-shot classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-22 DOI: 10.1016/j.imavis.2025.105550

Guowei Yin , Sheng Huang , Luwen Huangfu , Yi Zhang , Xiaohong Zhang

Feature quality is paramount for classification performance, particularly in few-shot scenarios. Contrastive learning, a widely adopted technique for enhancing feature quality, leverages sample relations to extract intrinsic features that capture semantic information and has achieved remarkable success in Few-Shot Learning (FSL). Nevertheless, current few-shot contrastive learning approaches often overlook the semantic similarity discrepancies at different granularities when employing the same modeling approach for different sample relations, which limits the potential of few-shot contrastive learning. In this paper, we introduce a straightforward yet effective contrastive learning approach, Multi-Grained Relation Contrastive Learning (MGRCL), as a pre-training feature learning model to boost few-shot learning by meticulously modeling sample relations at different granularities. MGRCL categorizes sample relations into three types: intra-sample relation of the same sample under different transformations, intra-class relation of homogeneous samples, and inter-class relation of inhomogeneous samples. In MGRCL, we design Transformation Consistency Learning (TCL) to ensure the rigorous semantic consistency of a sample under different transformations by aligning predictions of input pairs. Furthermore, to preserve discriminative information, we employ Class Contrastive Learning (CCL) to ensure that a sample is always closer to its homogeneous samples than its inhomogeneous ones, as homogeneous samples share similar semantic content while inhomogeneous samples have different semantic content. Our method is assessed across four popular FSL benchmarks, showing that such a simple pre-training feature learning method surpasses a majority of leading FSL methods. Moreover, our method can be incorporated into other FSL methods as the pre-trained model and help them obtain significant performance gains.

{"title":"Rethinking the sample relations for few-shot classification","authors":"Guowei Yin , Sheng Huang , Luwen Huangfu , Yi Zhang , Xiaohong Zhang","doi":"10.1016/j.imavis.2025.105550","DOIUrl":"10.1016/j.imavis.2025.105550","url":null,"abstract":"<div><div>Feature quality is paramount for classification performance, particularly in few-shot scenarios. Contrastive learning, a widely adopted technique for enhancing feature quality, leverages sample relations to extract intrinsic features that capture semantic information and has achieved remarkable success in Few-Shot Learning (FSL). Nevertheless, current few-shot contrastive learning approaches often overlook the semantic similarity discrepancies at different granularities when employing the same modeling approach for different sample relations, which limits the potential of few-shot contrastive learning. In this paper, we introduce a straightforward yet effective contrastive learning approach, Multi-Grained Relation Contrastive Learning (MGRCL), as a pre-training feature learning model to boost few-shot learning by meticulously modeling sample relations at different granularities. MGRCL categorizes sample relations into three types: intra-sample relation of the same sample under different transformations, intra-class relation of homogeneous samples, and inter-class relation of inhomogeneous samples. In MGRCL, we design Transformation Consistency Learning (TCL) to ensure the rigorous semantic consistency of a sample under different transformations by aligning predictions of input pairs. Furthermore, to preserve discriminative information, we employ Class Contrastive Learning (CCL) to ensure that a sample is always closer to its homogeneous samples than its inhomogeneous ones, as homogeneous samples share similar semantic content while inhomogeneous samples have different semantic content. Our method is assessed across four popular FSL benchmarks, showing that such a simple pre-training feature learning method surpasses a majority of leading FSL methods. Moreover, our method can be incorporated into other FSL methods as the pre-trained model and help them obtain significant performance gains.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105550"},"PeriodicalIF":4.2,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143864625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

W-Net: A facial feature-guided face super-resolution network

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-17 DOI: 10.1016/j.imavis.2025.105549

Hao Liu , Yang Yang , Yunxia Liu

Face Super-Resolution (FSR) aims to recover high-resolution (HR) face images from low-resolution (LR) ones. Despite the progress made by convolutional neural networks in FSR, the results of existing approaches are not ideal due to their low reconstruction efficiency and insufficient utilization of prior information. Considering that faces are highly structured objects, effectively leveraging facial priors to improve FSR results is a worthwhile endeavor. This paper proposes a novel network architecture called W-Net to address this challenge. W-Net leverages a meticulously designed Parsing Block to fully exploit the resolution potential of LR image. We use this parsing map as an attention prior, effectively integrating information from both the parsing map and LR images. Simultaneously, we perform multiple fusions across different latent representation dimensions through the W-shaped network structure combined with the LPF(LR-Parsing Map Fusion Module). Additionally, we utilize a facial parsing graph as a mask, assigning different weights and loss functions to key facial areas to balance the performance of our reconstructed facial images between perceptual quality and pixel accuracy. We conducted extensive comparative experiments, not only limited to conventional facial super-resolution metrics but also extending to downstream tasks such as facial recognition and facial keypoint detection. The experiments demonstrate that W-Net exhibits outstanding performance in quantitative metrics, visual quality, and downstream tasks.

{"title":"W-Net: A facial feature-guided face super-resolution network","authors":"Hao Liu , Yang Yang , Yunxia Liu","doi":"10.1016/j.imavis.2025.105549","DOIUrl":"10.1016/j.imavis.2025.105549","url":null,"abstract":"<div><div>Face Super-Resolution (FSR) aims to recover high-resolution (HR) face images from low-resolution (LR) ones. Despite the progress made by convolutional neural networks in FSR, the results of existing approaches are not ideal due to their low reconstruction efficiency and insufficient utilization of prior information. Considering that faces are highly structured objects, effectively leveraging facial priors to improve FSR results is a worthwhile endeavor. This paper proposes a novel network architecture called W-Net to address this challenge. W-Net leverages a meticulously designed Parsing Block to fully exploit the resolution potential of LR image. We use this parsing map as an attention prior, effectively integrating information from both the parsing map and LR images. Simultaneously, we perform multiple fusions across different latent representation dimensions through the W-shaped network structure combined with the LPF(<strong>L</strong>R-<strong>P</strong>arsing Map <strong>F</strong>usion Module). Additionally, we utilize a facial parsing graph as a mask, assigning different weights and loss functions to key facial areas to balance the performance of our reconstructed facial images between perceptual quality and pixel accuracy. We conducted extensive comparative experiments, not only limited to conventional facial super-resolution metrics but also extending to downstream tasks such as facial recognition and facial keypoint detection. The experiments demonstrate that W-Net exhibits outstanding performance in quantitative metrics, visual quality, and downstream tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105549"},"PeriodicalIF":4.2,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143864623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GCESS: A two-phase generative learning framework for estimate molecular expression to cell detection and analysis

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-15 DOI: 10.1016/j.imavis.2025.105554

Tianwang Xun , Lei Su , Wenting Shang , Di Dong , Lizhi Shao

Whole slide image (WSI) plays an important role in cancer research. Cell recognition is the foundation and key steps of WSI analysis at the cellular level, including cell segmentation, subtypes detection and molecular expression prediction at the cellular level. Current end-to-end supervised learning models rely heavily on a large amount of manually labeled data and self-supervised learning models are limited to cell binary segmentation. All of these methods lack the ability to predict the expression level of molecules in single cells. In this study, we proposed a two-phase generative adversarial learning framework, named GCESS, which can achieve end-to-end cell binary segmentation, subtypes detection and molecular expression prediction simultaneously. The framework uses generative adversarial learning to obtain better cell binary segmentation results in the first phase by integrating the cell binary segmentation results of some segmentation models and generates multiplex immunohistochemistry (mIHC) images through generative adversarial networks to predict the expression of cell molecules in the second phase. The cell semantic segmentation results can be obtained by spatially mapping the binary segmentation and molecular expression results in pixel level. The method we proposed achieves a Dice of 0.865 on cell binary segmentation, an accuracy of 0.917 on cell semantic segmentation and a Peak Signal to Noise Ratio (PSNR) of 20.929 dB on mIHC images generating, outperforming other competing methods (P-value < 0.05). The method we proposed will provide an effective tool for cellular level analysis of digital pathology images and cancer research.

{"title":"GCESS: A two-phase generative learning framework for estimate molecular expression to cell detection and analysis","authors":"Tianwang Xun , Lei Su , Wenting Shang , Di Dong , Lizhi Shao","doi":"10.1016/j.imavis.2025.105554","DOIUrl":"10.1016/j.imavis.2025.105554","url":null,"abstract":"<div><div>Whole slide image (WSI) plays an important role in cancer research. Cell recognition is the foundation and key steps of WSI analysis at the cellular level, including cell segmentation, subtypes detection and molecular expression prediction at the cellular level. Current end-to-end supervised learning models rely heavily on a large amount of manually labeled data and self-supervised learning models are limited to cell binary segmentation. All of these methods lack the ability to predict the expression level of molecules in single cells. In this study, we proposed a two-phase generative adversarial learning framework, named GCESS, which can achieve end-to-end cell binary segmentation, subtypes detection and molecular expression prediction simultaneously. The framework uses generative adversarial learning to obtain better cell binary segmentation results in the first phase by integrating the cell binary segmentation results of some segmentation models and generates multiplex immunohistochemistry (mIHC) images through generative adversarial networks to predict the expression of cell molecules in the second phase. The cell semantic segmentation results can be obtained by spatially mapping the binary segmentation and molecular expression results in pixel level. The method we proposed achieves a Dice of 0.865 on cell binary segmentation, an accuracy of 0.917 on cell semantic segmentation and a Peak Signal to Noise Ratio (PSNR) of 20.929 dB on mIHC images generating, outperforming other competing methods (P-value <0.05). The method we proposed will provide an effective tool for cellular level analysis of digital pathology images and cancer research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105554"},"PeriodicalIF":4.2,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143858882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bridging efficiency and interpretability: Explainable AI for multi-classification of pulmonary diseases utilizing modified lightweight CNNs

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-15 DOI: 10.1016/j.imavis.2025.105553

Samia Khan, Farheen Siddiqui, Mohd Abdul Ahad

Pulmonary diseases are notable global health challenges that contribute to increased morbidity and mortality rates. Early and accurate diagnosis is essential for effective treatment. However, traditional apprehension of chest X-ray images is tiresome and susceptible to human error, particularly in resource-constrained settings. Current progress in deep learning, particularly convolutional neural networks, has enabled the automated classification of pulmonary diseases with increased accuracy. In this study, we have proposed an explainable AI approach using modified lightweight convolution neural networks, such as MobileNetV2, EfficientNet-B0, NASNetMobile, and ResNet50V2 to achieve efficient and interpretable classification of multiple pulmonary diseases. Lightweight CNNs are designed to minimize computational complexity while maintaining robust performance, making them ideal for mobile and embedded systems with limited processing power deployment. Our models demonstrated strong performance in detecting pulmonary diseases, with EfficientNet-B0 achieving an accuracy of 94.07%, precision of 94.16%, recall of 94.07%, and F1 score of 94.04%. Furthermore, we have incorporated explainability methods (grad-CAM & t-SNE) to enhance the transparency of model predictions, providing clinicians with a trustworthy tool for diagnostic decision support. The results suggest that lightweight CNNs effectively balance accuracy, efficiency, and interpretability, making them suitable for real-time pulmonary disease detection in clinical and low-resource environments

{"title":"Bridging efficiency and interpretability: Explainable AI for multi-classification of pulmonary diseases utilizing modified lightweight CNNs","authors":"Samia Khan, Farheen Siddiqui, Mohd Abdul Ahad","doi":"10.1016/j.imavis.2025.105553","DOIUrl":"10.1016/j.imavis.2025.105553","url":null,"abstract":"<div><div>Pulmonary diseases are notable global health challenges that contribute to increased morbidity and mortality rates. Early and accurate diagnosis is essential for effective treatment. However, traditional apprehension of chest X-ray images is tiresome and susceptible to human error, particularly in resource-constrained settings. Current progress in deep learning, particularly convolutional neural networks, has enabled the automated classification of pulmonary diseases with increased accuracy. In this study, we have proposed an explainable AI approach using modified lightweight convolution neural networks, such as MobileNetV2, EfficientNet-B0, NASNetMobile, and ResNet50V2 to achieve efficient and interpretable classification of multiple pulmonary diseases. Lightweight CNNs are designed to minimize computational complexity while maintaining robust performance, making them ideal for mobile and embedded systems with limited processing power deployment. Our models demonstrated strong performance in detecting pulmonary diseases, with EfficientNet-B0 achieving an accuracy of 94.07%, precision of 94.16%, recall of 94.07%, and F1 score of 94.04%. Furthermore, we have incorporated explainability methods (grad-CAM & t-SNE) to enhance the transparency of model predictions, providing clinicians with a trustworthy tool for diagnostic decision support. The results suggest that lightweight CNNs effectively balance accuracy, efficiency, and interpretability, making them suitable for real-time pulmonary disease detection in clinical and low-resource environments</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105553"},"PeriodicalIF":4.2,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143843875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic feature extraction and histopathology domain shift alignment for mitosis detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-14 DOI: 10.1016/j.imavis.2025.105541

Jiangxiao Han, Shikang Wang, Lianjun Wu, Wenyu Liu

Mitosis count is of crucial significance in cancer diagnosis; therefore, mitosis detection is a meaningful subject in medical image studies. The challenge of mitosis detection lies in the intra-class variance of mitosis and hard negatives, i.e., the sizes/ shapes of mitotic cells vary considerably and plenty of non-mitotic cells resemble mitosis, and the histopathology domain shift across datasets caused by different tissues and organs, scanners, labs, etc. In this paper, we propose a novel Domain Generalized Dynamic Mitosis Detector (DGDMD) to handle the intra-class variance and histopathology domain shift of mitosis detection with a dynamic mitosis feature extractor based on residual structured depth-wise convolution and domain shift alignment terms. The proposed dynamic mitosis feature extractor handles the intra-class variance caused by different sizes and shapes of mitotic cells as well as non-mitotic hard negatives. The proposed domain generalization schedule implemented via novel histopathology-mitosis domain shift alignments deals with the domain shift between histopathology slides in training and test datasets from different sources. We validate the domain generalization ability for mitosis detection of our algorithm on the MIDOG++ dataset and typical mitosis datasets, including the MIDOG 2021, ICPR MITOSIS 2014, AMIDA 2013, and TUPAC 16. Experimental results show that we achieve state-of-the-art (SOTA) performance on the MIDOG++ dataset for the domain generalization across tissue and organs of mitosis detection, across scanners on the MIDOG 2021 dataset, and across data sources on external datasets, demonstrating the effectiveness of our proposed method on the domain generalization of mitosis detection.

{"title":"Dynamic feature extraction and histopathology domain shift alignment for mitosis detection","authors":"Jiangxiao Han, Shikang Wang, Lianjun Wu, Wenyu Liu","doi":"10.1016/j.imavis.2025.105541","DOIUrl":"10.1016/j.imavis.2025.105541","url":null,"abstract":"<div><div>Mitosis count is of crucial significance in cancer diagnosis; therefore, mitosis detection is a meaningful subject in medical image studies. The challenge of mitosis detection lies in the intra-class variance of mitosis and hard negatives, i.e., the sizes/ shapes of mitotic cells vary considerably and plenty of non-mitotic cells resemble mitosis, and the histopathology domain shift across datasets caused by different tissues and organs, scanners, labs, etc. In this paper, we propose a novel Domain Generalized Dynamic Mitosis Detector (DGDMD) to handle the intra-class variance and histopathology domain shift of mitosis detection with a dynamic mitosis feature extractor based on residual structured depth-wise convolution and domain shift alignment terms. The proposed dynamic mitosis feature extractor handles the intra-class variance caused by different sizes and shapes of mitotic cells as well as non-mitotic hard negatives. The proposed domain generalization schedule implemented via novel histopathology-mitosis domain shift alignments deals with the domain shift between histopathology slides in training and test datasets from different sources. We validate the domain generalization ability for mitosis detection of our algorithm on the MIDOG++ dataset and typical mitosis datasets, including the MIDOG 2021, ICPR MITOSIS 2014, AMIDA 2013, and TUPAC 16. Experimental results show that we achieve state-of-the-art (SOTA) performance on the MIDOG++ dataset for the domain generalization across tissue and organs of mitosis detection, across scanners on the MIDOG 2021 dataset, and across data sources on external datasets, demonstrating the effectiveness of our proposed method on the domain generalization of mitosis detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105541"},"PeriodicalIF":4.2,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143843661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated dual CNN-based feature extraction with SMOTE for imbalanced diabetic retinopathy classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-12 DOI: 10.1016/j.imavis.2025.105537

Danyal Badar Soomro , Wang ChengLiang , Mahmood Ashraf , Dina Abdulaziz AlHammadi , Shtwai Alsubai , Carlo Medaglia , Nisreen Innab , Muhammad Umer

The primary cause of Diabetic Retinopathy (DR) is high blood sugar due to long-term diabetes. Early and correct diagnosis of the DR is essential for timely and effective treatment. Despite high performance of recently developed models, there is still a need to overcome the problem of class imbalance issues and feature extraction to achieve accurate results. To resolve this problem, we have presented an automated model combining the customized ResNet-50 and EfficientNetB0 for detecting and classifying DR in fundus images. The proposed model addresses class imbalance using data augmentation and Synthetic Minority Oversampling Technique (SMOTE) for oversampling the training data and enhances the feature extraction process through fine-tuned ResNet50 and EfficientNetB0 models with ReLU activations and global average pooling. Combining extracted features and then passing it to four different classifiers effectively captures both local and global spatial features, thereby improving classification accuracy for diabetic retinopathy. For Experiment, The APTOS 2019 Dataset is used, and it contains of 3662 high-quality fundus images. The performance of the proposed model is assessed using several metrics, and the findings are compared with contemporary methods for diabetic retinopathy detection. The suggested methodology demonstrates substantial enhancement in diabetic retinopathy diagnosis for fundus pictures. The proposed automated model attained an accuracy of 98.5% for binary classification and 92.73% for multiclass classification.

{"title":"Automated dual CNN-based feature extraction with SMOTE for imbalanced diabetic retinopathy classification","authors":"Danyal Badar Soomro , Wang ChengLiang , Mahmood Ashraf , Dina Abdulaziz AlHammadi , Shtwai Alsubai , Carlo Medaglia , Nisreen Innab , Muhammad Umer","doi":"10.1016/j.imavis.2025.105537","DOIUrl":"10.1016/j.imavis.2025.105537","url":null,"abstract":"<div><div>The primary cause of Diabetic Retinopathy (DR) is high blood sugar due to long-term diabetes. Early and correct diagnosis of the DR is essential for timely and effective treatment. Despite high performance of recently developed models, there is still a need to overcome the problem of class imbalance issues and feature extraction to achieve accurate results. To resolve this problem, we have presented an automated model combining the customized ResNet-50 and EfficientNetB0 for detecting and classifying DR in fundus images. The proposed model addresses class imbalance using data augmentation and Synthetic Minority Oversampling Technique (SMOTE) for oversampling the training data and enhances the feature extraction process through fine-tuned ResNet50 and EfficientNetB0 models with ReLU activations and global average pooling. Combining extracted features and then passing it to four different classifiers effectively captures both local and global spatial features, thereby improving classification accuracy for diabetic retinopathy. For Experiment, The APTOS 2019 Dataset is used, and it contains of 3662 high-quality fundus images. The performance of the proposed model is assessed using several metrics, and the findings are compared with contemporary methods for diabetic retinopathy detection. The suggested methodology demonstrates substantial enhancement in diabetic retinopathy diagnosis for fundus pictures. The proposed automated model attained an accuracy of 98.5% for binary classification and 92.73% for multiclass classification.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105537"},"PeriodicalIF":4.2,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143869522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFC-Net: Amodal instance segmentation with multi-path fusion and context-awareness

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-12 DOI: 10.1016/j.imavis.2025.105539

Yunfei Yang , Hongwei Deng , Yichun Wu

Amodal instance segmentation refers to sensing the entire instance in an image, thereby segmenting the visible parts of an object and the regions that may be masked. However, existing amodal instance segmentation methods predict rough mask edges and perform poorly in segmenting objects with significant size differences. In addition, the occlusion environment greatly limits the performance of the model. To address the above problems, this work proposes an amodal instance segmentation method called MFC-Net to accurately segment objects in an image. For the rough prediction of mask edges, the model introduces the multi-path transformer structure to obtain finer object semantic features and boundary information, which improves the accuracy of edge region segmentation. For the problem of poor segmentation of object instances with significant size differences, we design an adaptive feature fusion module AFF, which dynamically captures the scale changes related to object size and fuses the multi-scale semantic feature information, so that the model obtains a receptive field adapted to the object size. To address the poor performance of segmentation in the occlusion environment, we designed the context-aware mask segmentation module CMS in the prediction module to make a preliminary prediction of the object’s amodal region. The module enhances the amodal perception of the model by modeling the long-range dependencies of the objects and capturing the contextual information of the occluded part of the object. Compared with the state-of-the-art methods, the MFC-Net proposed in this paper achieves a mAP of 73.3% on the D2SA dataset and 33.9% and 36.9% on the KINS and COCOA-cls datasets, respectively. Moreover, MFC-Net can produce complete and detailed amodal masks.

模态实例分割是指感知图像中的整个实例，从而分割出物体的可见部分和可能被遮挡的区域。然而，现有的模态实例分割方法会预测粗糙的遮挡边缘，在分割大小差异明显的物体时表现不佳。此外，遮挡环境也极大地限制了模型的性能。针对上述问题，本研究提出了一种名为 MFC-Net 的模态实例分割方法，以准确分割图像中的物体。针对遮挡边缘的粗略预测，模型引入多路径变换器结构，获取更精细的物体语义特征和边界信息，提高了边缘区域分割的准确性。针对大小差异明显的物体实例分割效果不佳的问题，我们设计了自适应特征融合模块 AFF，它能动态捕捉与物体大小相关的尺度变化，并融合多尺度语义特征信息，从而使模型获得与物体大小相适应的感受野。针对遮挡环境下分割效果不佳的问题，我们在预测模块中设计了情境感知遮罩分割模块 CMS，对物体的模态区域进行初步预测。该模块通过对物体的长程依赖关系进行建模，并捕捉物体遮挡部分的上下文信息，从而增强了模型的模态感知能力。与最先进的方法相比，本文提出的 MFC-Net 在 D2SA 数据集上的 mAP 达到 73.3%，在 KINS 和 COCOA-cls 数据集上的 mAP 分别达到 33.9% 和 36.9%。此外，MFC-Net 还能生成完整而详细的模态掩码。

{"title":"MFC-Net: Amodal instance segmentation with multi-path fusion and context-awareness","authors":"Yunfei Yang , Hongwei Deng , Yichun Wu","doi":"10.1016/j.imavis.2025.105539","DOIUrl":"10.1016/j.imavis.2025.105539","url":null,"abstract":"<div><div>Amodal instance segmentation refers to sensing the entire instance in an image, thereby segmenting the visible parts of an object and the regions that may be masked. However, existing amodal instance segmentation methods predict rough mask edges and perform poorly in segmenting objects with significant size differences. In addition, the occlusion environment greatly limits the performance of the model. To address the above problems, this work proposes an amodal instance segmentation method called MFC-Net to accurately segment objects in an image. For the rough prediction of mask edges, the model introduces the multi-path transformer structure to obtain finer object semantic features and boundary information, which improves the accuracy of edge region segmentation. For the problem of poor segmentation of object instances with significant size differences, we design an adaptive feature fusion module AFF, which dynamically captures the scale changes related to object size and fuses the multi-scale semantic feature information, so that the model obtains a receptive field adapted to the object size. To address the poor performance of segmentation in the occlusion environment, we designed the context-aware mask segmentation module CMS in the prediction module to make a preliminary prediction of the object’s amodal region. The module enhances the amodal perception of the model by modeling the long-range dependencies of the objects and capturing the contextual information of the occluded part of the object. Compared with the state-of-the-art methods, the MFC-Net proposed in this paper achieves a mAP of 73.3% on the D2SA dataset and 33.9% and 36.9% on the KINS and COCOA-cls datasets, respectively. Moreover, MFC-Net can produce complete and detailed amodal masks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105539"},"PeriodicalIF":4.2,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143828331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid Attention Transformers with fast Fourier convolution for light field image super-resolution

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-10 DOI: 10.1016/j.imavis.2025.105542

Zhicheng Ma , Yuduo Guo , Zhaoxiang Liu , Shiguo Lian , Sen Wan

The limited spatial resolution of light field (LF) cameras has hindered their widespread adoption, emphasizing the critical need for superresolution techniques to improve their practical use. Transformer-based methods, such as LF-DET, have shown potential in enhancing light field spatial super-resolution (LF-SR). However, LF-DET, which employs a spatial-angular separable transformer encoder with sub-sampling spatial and multiscale angular modeling for global context interaction, struggles to effectively capture global context in early layers and local details. In this work, we introduce LF-HATF, a novel network that builds on the LF-DET framework and incorporates Fast Fourier Convolution (FFC) and Hybrid Attention Transformers (HATs) to address these limitations. This integration enables LF-HATF to better capture both global and local information, significantly improving the restoration of edge details and textures, and providing a more comprehensive understanding of complex scenes. Additionally, we propose the Light Field Charbonnier loss function, designed to balance differential distributions across various LF views. This function minimizes errors both within the same perspective and across different views, further enhancing the model’s performance. Our evaluation on five public LF datasets demonstrates that LF-HATF outperforms existing methods, representing a significant advancement in LF-SR technology. This progress pushes the field forward and opens new avenues for research in light field imaging, unlocking the full potential of light field cameras.

光场（LF）照相机的空间分辨率有限，阻碍了其广泛应用，因此迫切需要超分辨率技术来改善其实际应用。基于变压器的方法（如 LF-DET）已显示出提高光场空间超分辨率（LF-SR）的潜力。然而，LF-DET 采用空间-矩形可分离变压器编码器，并通过子采样空间和多尺度角度建模实现全局上下文交互，却难以有效捕捉早期层的全局上下文和局部细节。在这项工作中，我们引入了 LF-HATF，这是一种基于 LF-DET 框架并结合了快速傅立叶卷积（FFC）和混合注意力变换器（HAT）的新型网络，以解决这些局限性。这种整合使 LF-HATF 能够更好地捕捉全局和局部信息，显著改善边缘细节和纹理的还原，并提供对复杂场景更全面的理解。此外，我们还提出了光场 Charbonnier 损失函数，旨在平衡不同 LF 视图之间的差异分布。该函数最大限度地减少了同一视角和不同视角之间的误差，进一步提高了模型的性能。我们在五个公共 LF 数据集上进行的评估表明，LF-HATF 的性能优于现有方法，代表了 LF-SR 技术的重大进步。这一进步推动了光场成像领域的发展，为光场成像研究开辟了新的途径，释放了光场相机的全部潜能。

{"title":"Hybrid Attention Transformers with fast Fourier convolution for light field image super-resolution","authors":"Zhicheng Ma , Yuduo Guo , Zhaoxiang Liu , Shiguo Lian , Sen Wan","doi":"10.1016/j.imavis.2025.105542","DOIUrl":"10.1016/j.imavis.2025.105542","url":null,"abstract":"<div><div>The limited spatial resolution of light field (LF) cameras has hindered their widespread adoption, emphasizing the critical need for superresolution techniques to improve their practical use. Transformer-based methods, such as LF-DET, have shown potential in enhancing light field spatial super-resolution (LF-SR). However, LF-DET, which employs a spatial-angular separable transformer encoder with sub-sampling spatial and multiscale angular modeling for global context interaction, struggles to effectively capture global context in early layers and local details. In this work, we introduce LF-HATF, a novel network that builds on the LF-DET framework and incorporates Fast Fourier Convolution (FFC) and Hybrid Attention Transformers (HATs) to address these limitations. This integration enables LF-HATF to better capture both global and local information, significantly improving the restoration of edge details and textures, and providing a more comprehensive understanding of complex scenes. Additionally, we propose the Light Field Charbonnier loss function, designed to balance differential distributions across various LF views. This function minimizes errors both within the same perspective and across different views, further enhancing the model’s performance. Our evaluation on five public LF datasets demonstrates that LF-HATF outperforms existing methods, representing a significant advancement in LF-SR technology. This progress pushes the field forward and opens new avenues for research in light field imaging, unlocking the full potential of light field cameras.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105542"},"PeriodicalIF":4.2,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143828330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Controlling vision-language model for enhancing image restoration

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-08 DOI: 10.1016/j.imavis.2025.105538

Mingwen Shao, Weihan Liu, Qiwang Li, Lingzhuang Meng, Yecong Wan

Restoring low-quality images to their original high-quality state remains a significant challenge due to inherent uncertainties, particularly in blind image restoration scenarios where the nature of degradation is unknown. Despite recent advances, many restoration techniques still grapple with robustness and adaptability across diverse degradation conditions. In this paper, we introduce an approach to augment the restoration model by exploiting the robust prior features of CLIP, a large-scale vision-language model, to enhance its proficiency in handling a broader spectrum of degradation tasks. We integrate the robust priors from CLIP into the pre-trained image restoration model via cross-attention mechanisms, and we design a Prior Adapter to modulate these features, thereby enhancing the model’s restoration performance. Additionally, we introduce an innovative prompt learning framework that harnesses CLIP’s multimodal alignment capabilities to fine-tune pre-trained restoration models. Furthermore, we utilize CLIP’s contrastive loss to ensure that the restored images align more closely with the prompts of clean images in CLIP’s latent space, thereby improving the quality of the restoration. Through comprehensive experiments, we demonstrate the effectiveness and robustness of our method, showcasing its superior adaptability to a wide array of degradation tasks. Our findings emphasize the potential of integrating vision-language models such as CLIP to advance the cutting-edge in image restoration.

{"title":"Controlling vision-language model for enhancing image restoration","authors":"Mingwen Shao, Weihan Liu, Qiwang Li, Lingzhuang Meng, Yecong Wan","doi":"10.1016/j.imavis.2025.105538","DOIUrl":"10.1016/j.imavis.2025.105538","url":null,"abstract":"<div><div>Restoring low-quality images to their original high-quality state remains a significant challenge due to inherent uncertainties, particularly in blind image restoration scenarios where the nature of degradation is unknown. Despite recent advances, many restoration techniques still grapple with robustness and adaptability across diverse degradation conditions. In this paper, we introduce an approach to augment the restoration model by exploiting the robust prior features of CLIP, a large-scale vision-language model, to enhance its proficiency in handling a broader spectrum of degradation tasks. We integrate the robust priors from CLIP into the pre-trained image restoration model via cross-attention mechanisms, and we design a Prior Adapter to modulate these features, thereby enhancing the model’s restoration performance. Additionally, we introduce an innovative prompt learning framework that harnesses CLIP’s multimodal alignment capabilities to fine-tune pre-trained restoration models. Furthermore, we utilize CLIP’s contrastive loss to ensure that the restored images align more closely with the prompts of clean images in CLIP’s latent space, thereby improving the quality of the restoration. Through comprehensive experiments, we demonstrate the effectiveness and robustness of our method, showcasing its superior adaptability to a wide array of degradation tasks. Our findings emphasize the potential of integrating vision-language models such as CLIP to advance the cutting-edge in image restoration.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105538"},"PeriodicalIF":4.2,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wave-based cross-phase representation for weakly supervised classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-04-05 DOI: 10.1016/j.imavis.2025.105527

Heng Zhou , Ping Zhong

Weakly Supervised Learning (WSL) aims to improve model robustness and manage label uncertainty, but current methods struggle to handle various weak label sources, such as incomplete and noisy labels. Additionally, these methods struggle with a lack of adaptability from reliance on prior knowledge and the complexity of managing data-label dependencies. To address these problems, we propose a wave-based cross-phase network (WCPN) to enhance adaptability for incomplete and noisy labels. Specifically, we expand wave representations and design a cross-phase token mixing (CPTM) module to refine feature relationships and integrate strategies for various weak labels. The proposed CPFE algorithm in the CPTM optimizes feature relationships by using self-interference and mutual-interference to process phase information between feature tokens, thus enhancing semantic consistency and discriminative ability. Furthermore, by employing a data-driven tri-branch structure and maximizing mutual information between features and labels, WCPN effectively overcomes the inflexibility caused by reliance on prior knowledge and complex data-label dependencies. In this way, WCPN leverages wave representations to enhance feature interactions, capture data complexity and diversity, and improve feature compactness for specific categories. Experimental results demonstrate that WCPN excels across various supervision levels and consistently outperforms existing advanced methods. It effectively handles noisy and incomplete labels, showing remarkable adaptability and enhanced feature understanding.

{"title":"Wave-based cross-phase representation for weakly supervised classification","authors":"Heng Zhou , Ping Zhong","doi":"10.1016/j.imavis.2025.105527","DOIUrl":"10.1016/j.imavis.2025.105527","url":null,"abstract":"<div><div>Weakly Supervised Learning (WSL) aims to improve model robustness and manage label uncertainty, but current methods struggle to handle various weak label sources, such as incomplete and noisy labels. Additionally, these methods struggle with a lack of adaptability from reliance on prior knowledge and the complexity of managing data-label dependencies. To address these problems, we propose a wave-based cross-phase network (WCPN) to enhance adaptability for incomplete and noisy labels. Specifically, we expand wave representations and design a cross-phase token mixing (CPTM) module to refine feature relationships and integrate strategies for various weak labels. The proposed CPFE algorithm in the CPTM optimizes feature relationships by using self-interference and mutual-interference to process phase information between feature tokens, thus enhancing semantic consistency and discriminative ability. Furthermore, by employing a data-driven tri-branch structure and maximizing mutual information between features and labels, WCPN effectively overcomes the inflexibility caused by reliance on prior knowledge and complex data-label dependencies. In this way, WCPN leverages wave representations to enhance feature interactions, capture data complexity and diversity, and improve feature compactness for specific categories. Experimental results demonstrate that WCPN excels across various supervision levels and consistently outperforms existing advanced methods. It effectively handles noisy and incomplete labels, showing remarkable adaptability and enhanced feature understanding.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105527"},"PeriodicalIF":4.2,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143785590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0