Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105391
Chen Wang , Huifang Ma , Di Zhang , Xiaolong Li , Zhixin Li
Image-level Weakly-Supervised Semantic Segmentation (WSSS) has become prominent as a technique that utilizes readily available image-level supervisory information. However, traditional methods that rely on pseudo-segmentation labels derived from Class Activation Maps (CAMs) are limited in terms of segmentation accuracy, primarily due to the incomplete nature of CAMs. Despite recent advancements in improving the comprehensiveness of CAM-derived pseudo-labels, challenges persist in handling ambiguity at object boundaries, and these methods also tend to be computationally intensive. To address these challenges, we propose a novel framework called Neighbor-Attentive Superpixel Aggregation (NASA). Inspired by the effectiveness of superpixel segmentation in homogenizing images through color and texture analysis, NASA enables the transformation from superpixel-wise to pixel-wise pseudo-labels. This approach significantly reduces semantic uncertainty at object boundaries and alleviates the computational overhead associated with direct pixel-wise label generation from CAMs. Besides, we introduce a superpixel augmentation strategy to enhance the model’s discrimination capabilities across different superpixels. Empirical studies demonstrate the superiority of NASA over existing WSSS methodologies. On the PASCAL VOC 2012 and MS COCO 2014 datasets, NASA achieves impressive mIoU scores of 73.5% and 46.4%, respectively.
{"title":"Enhancing weakly supervised semantic segmentation with efficient and robust neighbor-attentive superpixel aggregation","authors":"Chen Wang , Huifang Ma , Di Zhang , Xiaolong Li , Zhixin Li","doi":"10.1016/j.imavis.2024.105391","DOIUrl":"10.1016/j.imavis.2024.105391","url":null,"abstract":"<div><div>Image-level Weakly-Supervised Semantic Segmentation (WSSS) has become prominent as a technique that utilizes readily available image-level supervisory information. However, traditional methods that rely on pseudo-segmentation labels derived from Class Activation Maps (CAMs) are limited in terms of segmentation accuracy, primarily due to the incomplete nature of CAMs. Despite recent advancements in improving the comprehensiveness of CAM-derived pseudo-labels, challenges persist in handling ambiguity at object boundaries, and these methods also tend to be computationally intensive. To address these challenges, we propose a novel framework called Neighbor-Attentive Superpixel Aggregation (NASA). Inspired by the effectiveness of superpixel segmentation in homogenizing images through color and texture analysis, NASA enables the transformation from superpixel-wise to pixel-wise pseudo-labels. This approach significantly reduces semantic uncertainty at object boundaries and alleviates the computational overhead associated with direct pixel-wise label generation from CAMs. Besides, we introduce a superpixel augmentation strategy to enhance the model’s discrimination capabilities across different superpixels. Empirical studies demonstrate the superiority of NASA over existing WSSS methodologies. On the PASCAL VOC 2012 and MS COCO 2014 datasets, NASA achieves impressive mIoU scores of 73.5% and 46.4%, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105391"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105379
Pablo Garcia-Fernandez, Daniel Cores, Manuel Mucientes
Few-shot object detection involves adapting an existing detector to a set of unseen categories with few annotated examples. This data limitation makes these methods to underperform those trained on large labeled datasets. In many scenarios, there is a high amount of unlabeled data that is never exploited. Thus, we propose to exPAND the initial novel set by mining pseudo-labels. From a raw set of detections, xPAND obtains reliable pseudo-labels suitable for training any detector. To this end, we propose two new modules: Class and Box confirmation. Class Confirmation aims to remove misclassified pseudo-labels by comparing candidates with expected class prototypes. Box Confirmation estimates IoU to discard inadequately framed objects. Experimental results demonstrate that xPAND enhances the performance of multiple detectors up to +5.9 nAP and +16.4 nAP50 points for MS-COCO and PASCAL VOC, respectively, establishing a new state of the art. Code: https://github.com/PAGF188/xPAND.
{"title":"Enhancing few-shot object detection through pseudo-label mining","authors":"Pablo Garcia-Fernandez, Daniel Cores, Manuel Mucientes","doi":"10.1016/j.imavis.2024.105379","DOIUrl":"10.1016/j.imavis.2024.105379","url":null,"abstract":"<div><div>Few-shot object detection involves adapting an existing detector to a set of unseen categories with few annotated examples. This data limitation makes these methods to underperform those trained on large labeled datasets. In many scenarios, there is a high amount of unlabeled data that is never exploited. Thus, we propose to e<strong>xPAND</strong> the initial novel set by mining pseudo-labels. From a raw set of detections, xPAND obtains reliable pseudo-labels suitable for training any detector. To this end, we propose two new modules: Class and Box confirmation. Class Confirmation aims to remove misclassified pseudo-labels by comparing candidates with expected class prototypes. Box Confirmation estimates IoU to discard inadequately framed objects. Experimental results demonstrate that xPAND enhances the performance of multiple detectors up to +5.9 nAP and +16.4 nAP50 points for MS-COCO and PASCAL VOC, respectively, establishing a new state of the art. Code: <span><span>https://github.com/PAGF188/xPAND</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105379"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2025.105426
Hongjuan Pei , Jiaying Chen , Shihao Gao , Taisong Jin , Ke Lu
Human skeleton action recognition has garnered significant attention from researchers due to its promising performance in real-world applications. Recently, graph neural networks (GNNs) have been applied to this field, with graph convolution networks (GCNs) being commonly utilized to modulate the spatial configuration and temporal dynamics of joints. However, the GCN-based paradigm for skeleton action recognition fails to recognize and disentangle the heterogeneous factors of action representation. Consequently, the learned action features are susceptible to irrelevant factors, hindering further performance enhancement. To address this issue and learn a disentangled action representation, we propose a novel skeleton action recognition method, termed -bVGAE. The proposed method leverages group sparsity constrained Variant graph auto-encoder, rather than graph convolutional networks, to learn the discriminative features of the skeleton sequence. Extensive experiments conducted on benchmark action recognition datasets demonstrate that our proposed method outperforms existing GCN-based skeleton action recognition methods, highlighting the significant potential of the variant auto-encoder architecture in the field of skeleton action recognition.
{"title":"Skeleton action recognition via group sparsity constrained variant graph auto-encoder","authors":"Hongjuan Pei , Jiaying Chen , Shihao Gao , Taisong Jin , Ke Lu","doi":"10.1016/j.imavis.2025.105426","DOIUrl":"10.1016/j.imavis.2025.105426","url":null,"abstract":"<div><div>Human skeleton action recognition has garnered significant attention from researchers due to its promising performance in real-world applications. Recently, graph neural networks (GNNs) have been applied to this field, with graph convolution networks (GCNs) being commonly utilized to modulate the spatial configuration and temporal dynamics of joints. However, the GCN-based paradigm for skeleton action recognition fails to recognize and disentangle the heterogeneous factors of action representation. Consequently, the learned action features are susceptible to irrelevant factors, hindering further performance enhancement. To address this issue and learn a disentangled action representation, we propose a novel skeleton action recognition method, termed <span><math><mi>β</mi></math></span>-bVGAE. The proposed method leverages group sparsity constrained Variant graph auto-encoder, rather than graph convolutional networks, to learn the discriminative features of the skeleton sequence. Extensive experiments conducted on benchmark action recognition datasets demonstrate that our proposed method outperforms existing GCN-based skeleton action recognition methods, highlighting the significant potential of the variant auto-encoder architecture in the field of skeleton action recognition.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105426"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105346
Naigong Yu, YiFan Fu, QiuSheng Xie, QiMing Cheng, Mohammad Mehedi Hasan
With the application and popularization of depth cameras, image fusion techniques based on infrared and visible light are increasingly used in various fields. Object detection and robot navigation impose more stringent requirements on the texture details and image quality of fused images. Existing residual network, attention mechanisms, and generative adversarial network are ineffective in dealing with the image fusion problem because of insufficient detail feature extraction and non-conformity to the human visual perception system during the fusion of infrared and visible light images. Our newly developed RGFusion network relies on a two-channel attentional mechanism, a residual network, and a generative adversarial network that introduces two new components: a high-precision image feature extractor and an efficient multi-stage training strategy. The network is preprocessed by a high-dimensional mapping and the complex feature extractor is processed through a sophisticated two-stage image fusion process to obtain feature structures with multiple features, resulting in high-quality fused images rich in detailed features. Extensive experiments on public datasets validate this fusion approach, and RGFusion is at the forefront of SD metrics for EN and SF, reaching 7.366, 13.322, and 49.281 on the TNO dataset and 7.276, 19.171, and 53.777 on the RoadScene dataset, respectively.
{"title":"Feature extraction and fusion algorithm for infrared visible light images based on residual and generative adversarial network","authors":"Naigong Yu, YiFan Fu, QiuSheng Xie, QiMing Cheng, Mohammad Mehedi Hasan","doi":"10.1016/j.imavis.2024.105346","DOIUrl":"10.1016/j.imavis.2024.105346","url":null,"abstract":"<div><div>With the application and popularization of depth cameras, image fusion techniques based on infrared and visible light are increasingly used in various fields. Object detection and robot navigation impose more stringent requirements on the texture details and image quality of fused images. Existing residual network, attention mechanisms, and generative adversarial network are ineffective in dealing with the image fusion problem because of insufficient detail feature extraction and non-conformity to the human visual perception system during the fusion of infrared and visible light images. Our newly developed RGFusion network relies on a two-channel attentional mechanism, a residual network, and a generative adversarial network that introduces two new components: a high-precision image feature extractor and an efficient multi-stage training strategy. The network is preprocessed by a high-dimensional mapping and the complex feature extractor is processed through a sophisticated two-stage image fusion process to obtain feature structures with multiple features, resulting in high-quality fused images rich in detailed features. Extensive experiments on public datasets validate this fusion approach, and RGFusion is at the forefront of SD metrics for EN and SF, reaching 7.366, 13.322, and 49.281 on the TNO dataset and 7.276, 19.171, and 53.777 on the RoadScene dataset, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105346"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105382
Qing Pan, Xiayuan Feng, Nili Tian
In order to obtain the higher efficiency and the more accuracy in camouflaged object detection (COD), a lightweight COD network based on edge detection and coordinate attention assistance (EDCAANet) is presented in this paper. Firstly, an Integrated Edge and Global Context Information Module (IEGC) is proposed, which uses edge detection as an auxiliary means to collaborate with the atrous spatial convolution pooling pyramid (ASPP) for obtaining global context information to achieve the preliminary positioning of the camouflaged object. Then, the Receptive Field Module based on Coordinate Attention (RFMC) is put forward, in which the Coordinate Attention (CA) mechanism is employed as another aid means to expand receptive ffeld features and then achieve global comprehensive of the image. In the final stage of feature fusion, the proposed lightweight Adjacent and Global Context Focusing module (AGCF) is employed to aggregate the multi-scale semantic features output by RFMC at adjacent levels and the global context features output by IEGC. These aggregated features are mainly refined by the proposed Multi Scale Convolutional Aggregation (MSDA) blocks in the module, allowing features to interact and combine at various scales to ultimately produce prediction results. The experiments include performance comparison experiment, testing in complex background, generalization experiment, as well as ablation experiment and complexity analysis. Four public datasets are adopted for experiments, four recognized COD metrics are employed for performance evaluation, 3 backbone networks and 18 methods are used for comparison. The experimental results show that the proposed method can obtain both the more excellent detection performance and the higher efficiency.
{"title":"EDCAANet: A lightweight COD network based on edge detection and coordinate attention assistance","authors":"Qing Pan, Xiayuan Feng, Nili Tian","doi":"10.1016/j.imavis.2024.105382","DOIUrl":"10.1016/j.imavis.2024.105382","url":null,"abstract":"<div><div>In order to obtain the higher efficiency and the more accuracy in camouflaged object detection (COD), a lightweight COD network based on edge detection and coordinate attention assistance (EDCAANet) is presented in this paper. Firstly, an Integrated Edge and Global Context Information Module (IEGC) is proposed, which uses edge detection as an auxiliary means to collaborate with the atrous spatial convolution pooling pyramid (ASPP) for obtaining global context information to achieve the preliminary positioning of the camouflaged object. Then, the Receptive Field Module based on Coordinate Attention (RFMC) is put forward, in which the Coordinate Attention (CA) mechanism is employed as another aid means to expand receptive ffeld features and then achieve global comprehensive of the image. In the final stage of feature fusion, the proposed lightweight Adjacent and Global Context Focusing module (AGCF) is employed to aggregate the multi-scale semantic features output by RFMC at adjacent levels and the global context features output by IEGC. These aggregated features are mainly refined by the proposed Multi Scale Convolutional Aggregation (MSDA) blocks in the module, allowing features to interact and combine at various scales to ultimately produce prediction results. The experiments include performance comparison experiment, testing in complex background, generalization experiment, as well as ablation experiment and complexity analysis. Four public datasets are adopted for experiments, four recognized COD metrics are employed for performance evaluation, 3 backbone networks and 18 methods are used for comparison. The experimental results show that the proposed method can obtain both the more excellent detection performance and the higher efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105382"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105356
Yehui Wang , Fang Lei , Baoyan Wang , Qiang Zhang , Xiantong Zhen , Lei Zhang
Referring Image Segmentation (RIS) is a challenging computer vision task that involves identifying and segmenting specific objects in an image based on a natural language description. Unlike conventional segmentation methodologies, RIS needs to bridge the gap between visual and linguistic modalities to exert the semantic information provided by natural language. Most existing RIS approaches are confronted with the common issue that the intermediate predicted target region also participates in the later feature generation and parameter updating. Then the wrong prediction, especially occurs in the early training stage, will bring the gradient misleading and ultimately affect the training stability. To tackle this issue, we propose de-noising mask (DNM) transformer to fuse the cross-modal integration, a novel framework to replace the cross-attention by DNM-attention in traditional transformer. Furthermore, two kinds of DNM-attention, named mask-DNM and cluster-DNM, are proposed, where noisy ground truth information is adopted to guide the attention mechanism to produce accurate object queries, i.e., de-nosing query. Thus, DNM-attention leverages noisy ground truth information to guide the attention mechanism to produce additional de-nosing queries, which effectively avoids the gradient misleading. Experimental results show that the DNM transformer improves the performance of RIS and outperforms most existing RIS approaches on three benchmarks.
{"title":"De-noising mask transformer for referring image segmentation","authors":"Yehui Wang , Fang Lei , Baoyan Wang , Qiang Zhang , Xiantong Zhen , Lei Zhang","doi":"10.1016/j.imavis.2024.105356","DOIUrl":"10.1016/j.imavis.2024.105356","url":null,"abstract":"<div><div>Referring Image Segmentation (RIS) is a challenging computer vision task that involves identifying and segmenting specific objects in an image based on a natural language description. Unlike conventional segmentation methodologies, RIS needs to bridge the gap between visual and linguistic modalities to exert the semantic information provided by natural language. Most existing RIS approaches are confronted with the common issue that the intermediate predicted target region also participates in the later feature generation and parameter updating. Then the wrong prediction, especially occurs in the early training stage, will bring the gradient misleading and ultimately affect the training stability. To tackle this issue, we propose de-noising mask (DNM) transformer to fuse the cross-modal integration, a novel framework to replace the cross-attention by DNM-attention in traditional transformer. Furthermore, two kinds of DNM-attention, named mask-DNM and cluster-DNM, are proposed, where noisy ground truth information is adopted to guide the attention mechanism to produce accurate object queries, <em>i.e.</em>, de-nosing query. Thus, DNM-attention leverages noisy ground truth information to guide the attention mechanism to produce additional de-nosing queries, which effectively avoids the gradient misleading. Experimental results show that the DNM transformer improves the performance of RIS and outperforms most existing RIS approaches on three benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105356"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105364
Alireza Esmaeilzehi , Amir Mohammad Babaei , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad
Blind image super resolution (SR) is a challenging computer vision task, which involves enhancing the quality of the low-resolution (LR) images obtained by various degradation operations. Deep neural networks have provided state-of-the-art performances for the task of image SR in a blind fashion. It has been shown in the literature that by decoupling the task of blind image SR into the blurring kernel estimation and high-quality image reconstruction, superior performance can be obtained. In this paper, we first propose a novel optimization problem that, by using the geometrical information as prior, is able to estimate the blurring kernels in an accurate manner. We then propose a novel blind image SR network that employs the blurring kernel thus estimated in its network architecture and learning algorithm in order to generate high-quality images. In this regard, we utilize the curriculum learning strategy, wherein the training process of the SR network is initially facilitated by using the ground truth (GT) blurring kernel and then continued with the estimated blurring kernel obtained from our optimization problem. The results of various experiments show the effectiveness of the proposed blind image SR scheme in comparison to state-of-the-art methods on various degradation operations and benchmark datasets.
{"title":"CLBSR: A deep curriculum learning-based blind image super resolution network using geometrical prior","authors":"Alireza Esmaeilzehi , Amir Mohammad Babaei , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad","doi":"10.1016/j.imavis.2024.105364","DOIUrl":"10.1016/j.imavis.2024.105364","url":null,"abstract":"<div><div>Blind image super resolution (SR) is a challenging computer vision task, which involves enhancing the quality of the low-resolution (LR) images obtained by various degradation operations. Deep neural networks have provided state-of-the-art performances for the task of image SR in a blind fashion. It has been shown in the literature that by decoupling the task of blind image SR into the blurring kernel estimation and high-quality image reconstruction, superior performance can be obtained. In this paper, we first propose a novel optimization problem that, by using the geometrical information as prior, is able to estimate the blurring kernels in an accurate manner. We then propose a novel blind image SR network that employs the blurring kernel thus estimated in its network architecture and learning algorithm in order to generate high-quality images. In this regard, we utilize the curriculum learning strategy, wherein the training process of the SR network is initially facilitated by using the ground truth (GT) blurring kernel and then continued with the estimated blurring kernel obtained from our optimization problem. The results of various experiments show the effectiveness of the proposed blind image SR scheme in comparison to state-of-the-art methods on various degradation operations and benchmark datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105364"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105393
Muxin Liao , Shishun Tian , Yuhang Zhang , Guoguang Hua , Rong You , Wenbin Zou , Xia Li
Existing domain generalization semantic segmentation methods aim to improve the generalization ability by learning domain-invariant information for generalizing well on unseen domains. However, these methods ignore the class discriminability of models, which may lead to a class confusion problem. In this paper, a class-discriminative domain generalization (CDDG) approach is proposed to simultaneously alleviate the distribution shift and class confusion for semantic segmentation. Specifically, a dual prototypical contrastive learning module is proposed. Since the high-frequency component is consistent across different domains, a class-text-guided high-frequency prototypical contrastive learning is proposed. It uses text embeddings as prior knowledge for guiding the learning of high-frequency prototypical representation from high-frequency components to mine domain-invariant information and further improve the generalization ability. However, the domain-specific information may also contain label-related information which refers to the discrimination of a specific class. Thus, only learning the domain-invariant information may limit the class discriminability of models. To address this issue, a low-frequency prototypical contrastive learning is proposed to learn the class-discriminative representation from low-frequency components since it is more domain-specific across different domains. Finally, the class-discriminative representation and high-frequency prototypical representation are fused to simultaneously improve the generalization ability and class discriminability of the model. Extensive experiments demonstrate that the proposed approach outperforms current methods on single- and multi-source domain generalization benchmarks.
{"title":"Class-discriminative domain generalization for semantic segmentation","authors":"Muxin Liao , Shishun Tian , Yuhang Zhang , Guoguang Hua , Rong You , Wenbin Zou , Xia Li","doi":"10.1016/j.imavis.2024.105393","DOIUrl":"10.1016/j.imavis.2024.105393","url":null,"abstract":"<div><div>Existing domain generalization semantic segmentation methods aim to improve the generalization ability by learning domain-invariant information for generalizing well on unseen domains. However, these methods ignore the class discriminability of models, which may lead to a class confusion problem. In this paper, a class-discriminative domain generalization (CDDG) approach is proposed to simultaneously alleviate the distribution shift and class confusion for semantic segmentation. Specifically, a dual prototypical contrastive learning module is proposed. Since the high-frequency component is consistent across different domains, a class-text-guided high-frequency prototypical contrastive learning is proposed. It uses text embeddings as prior knowledge for guiding the learning of high-frequency prototypical representation from high-frequency components to mine domain-invariant information and further improve the generalization ability. However, the domain-specific information may also contain label-related information which refers to the discrimination of a specific class. Thus, only learning the domain-invariant information may limit the class discriminability of models. To address this issue, a low-frequency prototypical contrastive learning is proposed to learn the class-discriminative representation from low-frequency components since it is more domain-specific across different domains. Finally, the class-discriminative representation and high-frequency prototypical representation are fused to simultaneously improve the generalization ability and class discriminability of the model. Extensive experiments demonstrate that the proposed approach outperforms current methods on single- and multi-source domain generalization benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105393"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.
{"title":"Efficient and robust multi-camera 3D object detection in bird-eye-view","authors":"Yuanlong Wang, Hengtao Jiang, Guanying Chen, Tong Zhang, Jiaqing Zhou, Zezheng Qing, Chunyan Wang, Wanzhong Zhao","doi":"10.1016/j.imavis.2025.105428","DOIUrl":"10.1016/j.imavis.2025.105428","url":null,"abstract":"<div><div>Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105428"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2025.105432
Abbas Rehman , Gu Naijie , Asma Aldrees , Muhammad Umer , Abeer Hakeem , Shtwai Alsubai , Lucia Cascone
Brain tumors represent a significant global health challenge, characterized by uncontrolled cerebral cell growth. The variability in size, shape, and anatomical positioning complicates computational classification, which is crucial for effective treatment planning. Accurate detection is essential, as even small diagnostic inaccuracies can significantly increase the mortality risk. Tumor grade stratification is also critical for automated diagnosis; however, current deep learning models often fall short in achieving the desired effectiveness. In this study, we propose an advanced approach that leverages cutting-edge deep learning techniques to improve early detection and tumor severity grading, facilitating automated diagnosis. Clinical bioinformatics datasets are used to source representative brain tumor images, which undergo pre-processing and data augmentation via a Generative Adversarial Network (GAN). The images are then classified using the Adaptive Layer Cascaded ResNet (ALCResNet) model, optimized with the Improved Border Collie Optimization (IBCO) algorithm for enhanced diagnostic accuracy. The integration of FusionNet for precise segmentation and the IBCO-enhanced ALCResNet for optimized feature extraction and classification forms a novel framework. This unique combination ensures not only accurate segmentation but also enhanced precision in grading tumor severity, addressing key limitations of existing methodologies. For segmentation, the FusionNet deep learning model is employed to identify abnormal regions, which are subsequently classified as Meningioma, Glioma, or Pituitary tumors using ALCResNet. Experimental results demonstrate significant improvements in tumor identification and severity grading, with the proposed method achieving superior precision (99.79%) and accuracy (99.33%) compared to existing classifiers and heuristic approaches.
{"title":"Advancing brain tumor segmentation and grading through integration of FusionNet and IBCO-based ALCResNet","authors":"Abbas Rehman , Gu Naijie , Asma Aldrees , Muhammad Umer , Abeer Hakeem , Shtwai Alsubai , Lucia Cascone","doi":"10.1016/j.imavis.2025.105432","DOIUrl":"10.1016/j.imavis.2025.105432","url":null,"abstract":"<div><div>Brain tumors represent a significant global health challenge, characterized by uncontrolled cerebral cell growth. The variability in size, shape, and anatomical positioning complicates computational classification, which is crucial for effective treatment planning. Accurate detection is essential, as even small diagnostic inaccuracies can significantly increase the mortality risk. Tumor grade stratification is also critical for automated diagnosis; however, current deep learning models often fall short in achieving the desired effectiveness. In this study, we propose an advanced approach that leverages cutting-edge deep learning techniques to improve early detection and tumor severity grading, facilitating automated diagnosis. Clinical bioinformatics datasets are used to source representative brain tumor images, which undergo pre-processing and data augmentation via a Generative Adversarial Network (GAN). The images are then classified using the Adaptive Layer Cascaded ResNet (ALCResNet) model, optimized with the Improved Border Collie Optimization (IBCO) algorithm for enhanced diagnostic accuracy. The integration of FusionNet for precise segmentation and the IBCO-enhanced ALCResNet for optimized feature extraction and classification forms a novel framework. This unique combination ensures not only accurate segmentation but also enhanced precision in grading tumor severity, addressing key limitations of existing methodologies. For segmentation, the FusionNet deep learning model is employed to identify abnormal regions, which are subsequently classified as Meningioma, Glioma, or Pituitary tumors using ALCResNet. Experimental results demonstrate significant improvements in tumor identification and severity grading, with the proposed method achieving superior precision (99.79%) and accuracy (99.33%) compared to existing classifiers and heuristic approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105432"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}