首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Generative AI in the context of assistive technologies: Trends, limitations and future directions
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105347
Biying Fu , Abdenour Hadid , Naser Damer
With the tremendous successes of Large Language Models (LLMs) like ChatGPT for text generation and Dall-E for high-quality image generation, generative Artificial Intelligence (AI) models have shown a hype in our society. Generative AI seamlessly delved into different aspects of society ranging from economy, education, legislation, computer science, finance, and even healthcare. This article provides a comprehensive survey on the increased and promising use of generative AI in assistive technologies benefiting different parties, ranging from the assistive system developers, medical practitioners, care workforce, to the people who need the care and the comfort. Ethical concerns, biases, lack of transparency, insufficient explainability, and limited trustworthiness are major challenges when using generative AI in assistive technologies, particularly in systems that impact people directly. Key future research directions to address these issues include creating standardized rules, establishing commonly accepted evaluation metrics and benchmarks for explainability and reasoning processes, and making further advancements in understanding and reducing bias and its potential harms. Beyond showing the current trends of applying generative AI in the scope of assistive technologies in four identified key domains, which include care sectors, medical sectors, helping people in need, and co-working, the survey also discusses the current limitations and provides promising future research directions to foster better integration of generative AI in assistive technologies.
{"title":"Generative AI in the context of assistive technologies: Trends, limitations and future directions","authors":"Biying Fu ,&nbsp;Abdenour Hadid ,&nbsp;Naser Damer","doi":"10.1016/j.imavis.2024.105347","DOIUrl":"10.1016/j.imavis.2024.105347","url":null,"abstract":"<div><div>With the tremendous successes of Large Language Models (LLMs) like ChatGPT for text generation and Dall-E for high-quality image generation, generative Artificial Intelligence (AI) models have shown a hype in our society. Generative AI seamlessly delved into different aspects of society ranging from economy, education, legislation, computer science, finance, and even healthcare. This article provides a comprehensive survey on the increased and promising use of generative AI in assistive technologies benefiting different parties, ranging from the assistive system developers, medical practitioners, care workforce, to the people who need the care and the comfort. Ethical concerns, biases, lack of transparency, insufficient explainability, and limited trustworthiness are major challenges when using generative AI in assistive technologies, particularly in systems that impact people directly. Key future research directions to address these issues include creating standardized rules, establishing commonly accepted evaluation metrics and benchmarks for explainability and reasoning processes, and making further advancements in understanding and reducing bias and its potential harms. Beyond showing the current trends of applying generative AI in the scope of assistive technologies in four identified key domains, which include care sectors, medical sectors, helping people in need, and co-working, the survey also discusses the current limitations and provides promising future research directions to foster better integration of generative AI in assistive technologies.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105347"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual multi scale networks for medical image segmentation using contrastive learning
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105371
Akshat Dhamale , Ratnavel Rajalakshmi , Ananthakrishnan Balasundaram
DMSNet, a novel model for medical image segmentation is proposed in this research work. DMSNet employs a dual multi-scale architecture, combining the computational efficiency of EfficientNet B5 with the contextual understanding of the Pyramid Vision Transformer (PVT). Integration of a multi-scale module in both encoders enhances the model's capacity to capture intricate details across various resolutions, enabling precise delineation of complex foreground boundaries. Notably, DMSNet incorporates contrastive learning with a novel pixel-wise contrastive loss function during training, contributing to heightened segmentation accuracy and improved generalization capabilities. The model's performance is demonstrated through experimental evaluation on the four diverse datasets including Brain tumor segmentation (BraTS 2020), Diabetic Foot ulcer segmentation (DFU), Polyps (KVASIR-SEG) and Breast cancer segmentation (BCSS). We have employed recently introduced metrics to evaluate and compare our model with other state-of-the-art architectures. By advancing segmentation accuracy through innovative architectural design, multi-scale modules, and contrastive learning techniques, DMSNet represents a significant stride in the field, with potential implications for improved patient care and outcomes.
{"title":"Dual multi scale networks for medical image segmentation using contrastive learning","authors":"Akshat Dhamale ,&nbsp;Ratnavel Rajalakshmi ,&nbsp;Ananthakrishnan Balasundaram","doi":"10.1016/j.imavis.2024.105371","DOIUrl":"10.1016/j.imavis.2024.105371","url":null,"abstract":"<div><div>DMSNet, a novel model for medical image segmentation is proposed in this research work. DMSNet employs a dual multi-scale architecture, combining the computational efficiency of EfficientNet B5 with the contextual understanding of the Pyramid Vision Transformer (PVT). Integration of a multi-scale module in both encoders enhances the model's capacity to capture intricate details across various resolutions, enabling precise delineation of complex foreground boundaries. Notably, DMSNet incorporates contrastive learning with a novel pixel-wise contrastive loss function during training, contributing to heightened segmentation accuracy and improved generalization capabilities. The model's performance is demonstrated through experimental evaluation on the four diverse datasets including Brain tumor segmentation (BraTS 2020), Diabetic Foot ulcer segmentation (DFU), Polyps (KVASIR-SEG) and Breast cancer segmentation (BCSS). We have employed recently introduced metrics to evaluate and compare our model with other state-of-the-art architectures. By advancing segmentation accuracy through innovative architectural design, multi-scale modules, and contrastive learning techniques, DMSNet represents a significant stride in the field, with potential implications for improved patient care and outcomes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105371"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing weakly supervised semantic segmentation with efficient and robust neighbor-attentive superpixel aggregation
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105391
Chen Wang , Huifang Ma , Di Zhang , Xiaolong Li , Zhixin Li
Image-level Weakly-Supervised Semantic Segmentation (WSSS) has become prominent as a technique that utilizes readily available image-level supervisory information. However, traditional methods that rely on pseudo-segmentation labels derived from Class Activation Maps (CAMs) are limited in terms of segmentation accuracy, primarily due to the incomplete nature of CAMs. Despite recent advancements in improving the comprehensiveness of CAM-derived pseudo-labels, challenges persist in handling ambiguity at object boundaries, and these methods also tend to be computationally intensive. To address these challenges, we propose a novel framework called Neighbor-Attentive Superpixel Aggregation (NASA). Inspired by the effectiveness of superpixel segmentation in homogenizing images through color and texture analysis, NASA enables the transformation from superpixel-wise to pixel-wise pseudo-labels. This approach significantly reduces semantic uncertainty at object boundaries and alleviates the computational overhead associated with direct pixel-wise label generation from CAMs. Besides, we introduce a superpixel augmentation strategy to enhance the model’s discrimination capabilities across different superpixels. Empirical studies demonstrate the superiority of NASA over existing WSSS methodologies. On the PASCAL VOC 2012 and MS COCO 2014 datasets, NASA achieves impressive mIoU scores of 73.5% and 46.4%, respectively.
{"title":"Enhancing weakly supervised semantic segmentation with efficient and robust neighbor-attentive superpixel aggregation","authors":"Chen Wang ,&nbsp;Huifang Ma ,&nbsp;Di Zhang ,&nbsp;Xiaolong Li ,&nbsp;Zhixin Li","doi":"10.1016/j.imavis.2024.105391","DOIUrl":"10.1016/j.imavis.2024.105391","url":null,"abstract":"<div><div>Image-level Weakly-Supervised Semantic Segmentation (WSSS) has become prominent as a technique that utilizes readily available image-level supervisory information. However, traditional methods that rely on pseudo-segmentation labels derived from Class Activation Maps (CAMs) are limited in terms of segmentation accuracy, primarily due to the incomplete nature of CAMs. Despite recent advancements in improving the comprehensiveness of CAM-derived pseudo-labels, challenges persist in handling ambiguity at object boundaries, and these methods also tend to be computationally intensive. To address these challenges, we propose a novel framework called Neighbor-Attentive Superpixel Aggregation (NASA). Inspired by the effectiveness of superpixel segmentation in homogenizing images through color and texture analysis, NASA enables the transformation from superpixel-wise to pixel-wise pseudo-labels. This approach significantly reduces semantic uncertainty at object boundaries and alleviates the computational overhead associated with direct pixel-wise label generation from CAMs. Besides, we introduce a superpixel augmentation strategy to enhance the model’s discrimination capabilities across different superpixels. Empirical studies demonstrate the superiority of NASA over existing WSSS methodologies. On the PASCAL VOC 2012 and MS COCO 2014 datasets, NASA achieves impressive mIoU scores of 73.5% and 46.4%, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105391"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing few-shot object detection through pseudo-label mining
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105379
Pablo Garcia-Fernandez, Daniel Cores, Manuel Mucientes
Few-shot object detection involves adapting an existing detector to a set of unseen categories with few annotated examples. This data limitation makes these methods to underperform those trained on large labeled datasets. In many scenarios, there is a high amount of unlabeled data that is never exploited. Thus, we propose to exPAND the initial novel set by mining pseudo-labels. From a raw set of detections, xPAND obtains reliable pseudo-labels suitable for training any detector. To this end, we propose two new modules: Class and Box confirmation. Class Confirmation aims to remove misclassified pseudo-labels by comparing candidates with expected class prototypes. Box Confirmation estimates IoU to discard inadequately framed objects. Experimental results demonstrate that xPAND enhances the performance of multiple detectors up to +5.9 nAP and +16.4 nAP50 points for MS-COCO and PASCAL VOC, respectively, establishing a new state of the art. Code: https://github.com/PAGF188/xPAND.
{"title":"Enhancing few-shot object detection through pseudo-label mining","authors":"Pablo Garcia-Fernandez,&nbsp;Daniel Cores,&nbsp;Manuel Mucientes","doi":"10.1016/j.imavis.2024.105379","DOIUrl":"10.1016/j.imavis.2024.105379","url":null,"abstract":"<div><div>Few-shot object detection involves adapting an existing detector to a set of unseen categories with few annotated examples. This data limitation makes these methods to underperform those trained on large labeled datasets. In many scenarios, there is a high amount of unlabeled data that is never exploited. Thus, we propose to e<strong>xPAND</strong> the initial novel set by mining pseudo-labels. From a raw set of detections, xPAND obtains reliable pseudo-labels suitable for training any detector. To this end, we propose two new modules: Class and Box confirmation. Class Confirmation aims to remove misclassified pseudo-labels by comparing candidates with expected class prototypes. Box Confirmation estimates IoU to discard inadequately framed objects. Experimental results demonstrate that xPAND enhances the performance of multiple detectors up to +5.9 nAP and +16.4 nAP50 points for MS-COCO and PASCAL VOC, respectively, establishing a new state of the art. Code: <span><span>https://github.com/PAGF188/xPAND</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105379"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Skeleton action recognition via group sparsity constrained variant graph auto-encoder
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105426
Hongjuan Pei , Jiaying Chen , Shihao Gao , Taisong Jin , Ke Lu
Human skeleton action recognition has garnered significant attention from researchers due to its promising performance in real-world applications. Recently, graph neural networks (GNNs) have been applied to this field, with graph convolution networks (GCNs) being commonly utilized to modulate the spatial configuration and temporal dynamics of joints. However, the GCN-based paradigm for skeleton action recognition fails to recognize and disentangle the heterogeneous factors of action representation. Consequently, the learned action features are susceptible to irrelevant factors, hindering further performance enhancement. To address this issue and learn a disentangled action representation, we propose a novel skeleton action recognition method, termed β-bVGAE. The proposed method leverages group sparsity constrained Variant graph auto-encoder, rather than graph convolutional networks, to learn the discriminative features of the skeleton sequence. Extensive experiments conducted on benchmark action recognition datasets demonstrate that our proposed method outperforms existing GCN-based skeleton action recognition methods, highlighting the significant potential of the variant auto-encoder architecture in the field of skeleton action recognition.
{"title":"Skeleton action recognition via group sparsity constrained variant graph auto-encoder","authors":"Hongjuan Pei ,&nbsp;Jiaying Chen ,&nbsp;Shihao Gao ,&nbsp;Taisong Jin ,&nbsp;Ke Lu","doi":"10.1016/j.imavis.2025.105426","DOIUrl":"10.1016/j.imavis.2025.105426","url":null,"abstract":"<div><div>Human skeleton action recognition has garnered significant attention from researchers due to its promising performance in real-world applications. Recently, graph neural networks (GNNs) have been applied to this field, with graph convolution networks (GCNs) being commonly utilized to modulate the spatial configuration and temporal dynamics of joints. However, the GCN-based paradigm for skeleton action recognition fails to recognize and disentangle the heterogeneous factors of action representation. Consequently, the learned action features are susceptible to irrelevant factors, hindering further performance enhancement. To address this issue and learn a disentangled action representation, we propose a novel skeleton action recognition method, termed <span><math><mi>β</mi></math></span>-bVGAE. The proposed method leverages group sparsity constrained Variant graph auto-encoder, rather than graph convolutional networks, to learn the discriminative features of the skeleton sequence. Extensive experiments conducted on benchmark action recognition datasets demonstrate that our proposed method outperforms existing GCN-based skeleton action recognition methods, highlighting the significant potential of the variant auto-encoder architecture in the field of skeleton action recognition.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105426"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature extraction and fusion algorithm for infrared visible light images based on residual and generative adversarial network
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105346
Naigong Yu, YiFan Fu, QiuSheng Xie, QiMing Cheng, Mohammad Mehedi Hasan
With the application and popularization of depth cameras, image fusion techniques based on infrared and visible light are increasingly used in various fields. Object detection and robot navigation impose more stringent requirements on the texture details and image quality of fused images. Existing residual network, attention mechanisms, and generative adversarial network are ineffective in dealing with the image fusion problem because of insufficient detail feature extraction and non-conformity to the human visual perception system during the fusion of infrared and visible light images. Our newly developed RGFusion network relies on a two-channel attentional mechanism, a residual network, and a generative adversarial network that introduces two new components: a high-precision image feature extractor and an efficient multi-stage training strategy. The network is preprocessed by a high-dimensional mapping and the complex feature extractor is processed through a sophisticated two-stage image fusion process to obtain feature structures with multiple features, resulting in high-quality fused images rich in detailed features. Extensive experiments on public datasets validate this fusion approach, and RGFusion is at the forefront of SD metrics for EN and SF, reaching 7.366, 13.322, and 49.281 on the TNO dataset and 7.276, 19.171, and 53.777 on the RoadScene dataset, respectively.
{"title":"Feature extraction and fusion algorithm for infrared visible light images based on residual and generative adversarial network","authors":"Naigong Yu,&nbsp;YiFan Fu,&nbsp;QiuSheng Xie,&nbsp;QiMing Cheng,&nbsp;Mohammad Mehedi Hasan","doi":"10.1016/j.imavis.2024.105346","DOIUrl":"10.1016/j.imavis.2024.105346","url":null,"abstract":"<div><div>With the application and popularization of depth cameras, image fusion techniques based on infrared and visible light are increasingly used in various fields. Object detection and robot navigation impose more stringent requirements on the texture details and image quality of fused images. Existing residual network, attention mechanisms, and generative adversarial network are ineffective in dealing with the image fusion problem because of insufficient detail feature extraction and non-conformity to the human visual perception system during the fusion of infrared and visible light images. Our newly developed RGFusion network relies on a two-channel attentional mechanism, a residual network, and a generative adversarial network that introduces two new components: a high-precision image feature extractor and an efficient multi-stage training strategy. The network is preprocessed by a high-dimensional mapping and the complex feature extractor is processed through a sophisticated two-stage image fusion process to obtain feature structures with multiple features, resulting in high-quality fused images rich in detailed features. Extensive experiments on public datasets validate this fusion approach, and RGFusion is at the forefront of SD metrics for EN and SF, reaching 7.366, 13.322, and 49.281 on the TNO dataset and 7.276, 19.171, and 53.777 on the RoadScene dataset, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105346"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDCAANet: A lightweight COD network based on edge detection and coordinate attention assistance
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105382
Qing Pan, Xiayuan Feng, Nili Tian
In order to obtain the higher efficiency and the more accuracy in camouflaged object detection (COD), a lightweight COD network based on edge detection and coordinate attention assistance (EDCAANet) is presented in this paper. Firstly, an Integrated Edge and Global Context Information Module (IEGC) is proposed, which uses edge detection as an auxiliary means to collaborate with the atrous spatial convolution pooling pyramid (ASPP) for obtaining global context information to achieve the preliminary positioning of the camouflaged object. Then, the Receptive Field Module based on Coordinate Attention (RFMC) is put forward, in which the Coordinate Attention (CA) mechanism is employed as another aid means to expand receptive ffeld features and then achieve global comprehensive of the image. In the final stage of feature fusion, the proposed lightweight Adjacent and Global Context Focusing module (AGCF) is employed to aggregate the multi-scale semantic features output by RFMC at adjacent levels and the global context features output by IEGC. These aggregated features are mainly refined by the proposed Multi Scale Convolutional Aggregation (MSDA) blocks in the module, allowing features to interact and combine at various scales to ultimately produce prediction results. The experiments include performance comparison experiment, testing in complex background, generalization experiment, as well as ablation experiment and complexity analysis. Four public datasets are adopted for experiments, four recognized COD metrics are employed for performance evaluation, 3 backbone networks and 18 methods are used for comparison. The experimental results show that the proposed method can obtain both the more excellent detection performance and the higher efficiency.
{"title":"EDCAANet: A lightweight COD network based on edge detection and coordinate attention assistance","authors":"Qing Pan,&nbsp;Xiayuan Feng,&nbsp;Nili Tian","doi":"10.1016/j.imavis.2024.105382","DOIUrl":"10.1016/j.imavis.2024.105382","url":null,"abstract":"<div><div>In order to obtain the higher efficiency and the more accuracy in camouflaged object detection (COD), a lightweight COD network based on edge detection and coordinate attention assistance (EDCAANet) is presented in this paper. Firstly, an Integrated Edge and Global Context Information Module (IEGC) is proposed, which uses edge detection as an auxiliary means to collaborate with the atrous spatial convolution pooling pyramid (ASPP) for obtaining global context information to achieve the preliminary positioning of the camouflaged object. Then, the Receptive Field Module based on Coordinate Attention (RFMC) is put forward, in which the Coordinate Attention (CA) mechanism is employed as another aid means to expand receptive ffeld features and then achieve global comprehensive of the image. In the final stage of feature fusion, the proposed lightweight Adjacent and Global Context Focusing module (AGCF) is employed to aggregate the multi-scale semantic features output by RFMC at adjacent levels and the global context features output by IEGC. These aggregated features are mainly refined by the proposed Multi Scale Convolutional Aggregation (MSDA) blocks in the module, allowing features to interact and combine at various scales to ultimately produce prediction results. The experiments include performance comparison experiment, testing in complex background, generalization experiment, as well as ablation experiment and complexity analysis. Four public datasets are adopted for experiments, four recognized COD metrics are employed for performance evaluation, 3 backbone networks and 18 methods are used for comparison. The experimental results show that the proposed method can obtain both the more excellent detection performance and the higher efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105382"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De-noising mask transformer for referring image segmentation
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105356
Yehui Wang , Fang Lei , Baoyan Wang , Qiang Zhang , Xiantong Zhen , Lei Zhang
Referring Image Segmentation (RIS) is a challenging computer vision task that involves identifying and segmenting specific objects in an image based on a natural language description. Unlike conventional segmentation methodologies, RIS needs to bridge the gap between visual and linguistic modalities to exert the semantic information provided by natural language. Most existing RIS approaches are confronted with the common issue that the intermediate predicted target region also participates in the later feature generation and parameter updating. Then the wrong prediction, especially occurs in the early training stage, will bring the gradient misleading and ultimately affect the training stability. To tackle this issue, we propose de-noising mask (DNM) transformer to fuse the cross-modal integration, a novel framework to replace the cross-attention by DNM-attention in traditional transformer. Furthermore, two kinds of DNM-attention, named mask-DNM and cluster-DNM, are proposed, where noisy ground truth information is adopted to guide the attention mechanism to produce accurate object queries, i.e., de-nosing query. Thus, DNM-attention leverages noisy ground truth information to guide the attention mechanism to produce additional de-nosing queries, which effectively avoids the gradient misleading. Experimental results show that the DNM transformer improves the performance of RIS and outperforms most existing RIS approaches on three benchmarks.
{"title":"De-noising mask transformer for referring image segmentation","authors":"Yehui Wang ,&nbsp;Fang Lei ,&nbsp;Baoyan Wang ,&nbsp;Qiang Zhang ,&nbsp;Xiantong Zhen ,&nbsp;Lei Zhang","doi":"10.1016/j.imavis.2024.105356","DOIUrl":"10.1016/j.imavis.2024.105356","url":null,"abstract":"<div><div>Referring Image Segmentation (RIS) is a challenging computer vision task that involves identifying and segmenting specific objects in an image based on a natural language description. Unlike conventional segmentation methodologies, RIS needs to bridge the gap between visual and linguistic modalities to exert the semantic information provided by natural language. Most existing RIS approaches are confronted with the common issue that the intermediate predicted target region also participates in the later feature generation and parameter updating. Then the wrong prediction, especially occurs in the early training stage, will bring the gradient misleading and ultimately affect the training stability. To tackle this issue, we propose de-noising mask (DNM) transformer to fuse the cross-modal integration, a novel framework to replace the cross-attention by DNM-attention in traditional transformer. Furthermore, two kinds of DNM-attention, named mask-DNM and cluster-DNM, are proposed, where noisy ground truth information is adopted to guide the attention mechanism to produce accurate object queries, <em>i.e.</em>, de-nosing query. Thus, DNM-attention leverages noisy ground truth information to guide the attention mechanism to produce additional de-nosing queries, which effectively avoids the gradient misleading. Experimental results show that the DNM transformer improves the performance of RIS and outperforms most existing RIS approaches on three benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105356"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLBSR: A deep curriculum learning-based blind image super resolution network using geometrical prior
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105364
Alireza Esmaeilzehi , Amir Mohammad Babaei , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad
Blind image super resolution (SR) is a challenging computer vision task, which involves enhancing the quality of the low-resolution (LR) images obtained by various degradation operations. Deep neural networks have provided state-of-the-art performances for the task of image SR in a blind fashion. It has been shown in the literature that by decoupling the task of blind image SR into the blurring kernel estimation and high-quality image reconstruction, superior performance can be obtained. In this paper, we first propose a novel optimization problem that, by using the geometrical information as prior, is able to estimate the blurring kernels in an accurate manner. We then propose a novel blind image SR network that employs the blurring kernel thus estimated in its network architecture and learning algorithm in order to generate high-quality images. In this regard, we utilize the curriculum learning strategy, wherein the training process of the SR network is initially facilitated by using the ground truth (GT) blurring kernel and then continued with the estimated blurring kernel obtained from our optimization problem. The results of various experiments show the effectiveness of the proposed blind image SR scheme in comparison to state-of-the-art methods on various degradation operations and benchmark datasets.
{"title":"CLBSR: A deep curriculum learning-based blind image super resolution network using geometrical prior","authors":"Alireza Esmaeilzehi ,&nbsp;Amir Mohammad Babaei ,&nbsp;Farshid Nooshi ,&nbsp;Hossein Zaredar ,&nbsp;M. Omair Ahmad","doi":"10.1016/j.imavis.2024.105364","DOIUrl":"10.1016/j.imavis.2024.105364","url":null,"abstract":"<div><div>Blind image super resolution (SR) is a challenging computer vision task, which involves enhancing the quality of the low-resolution (LR) images obtained by various degradation operations. Deep neural networks have provided state-of-the-art performances for the task of image SR in a blind fashion. It has been shown in the literature that by decoupling the task of blind image SR into the blurring kernel estimation and high-quality image reconstruction, superior performance can be obtained. In this paper, we first propose a novel optimization problem that, by using the geometrical information as prior, is able to estimate the blurring kernels in an accurate manner. We then propose a novel blind image SR network that employs the blurring kernel thus estimated in its network architecture and learning algorithm in order to generate high-quality images. In this regard, we utilize the curriculum learning strategy, wherein the training process of the SR network is initially facilitated by using the ground truth (GT) blurring kernel and then continued with the estimated blurring kernel obtained from our optimization problem. The results of various experiments show the effectiveness of the proposed blind image SR scheme in comparison to state-of-the-art methods on various degradation operations and benchmark datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105364"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class-discriminative domain generalization for semantic segmentation
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105393
Muxin Liao , Shishun Tian , Yuhang Zhang , Guoguang Hua , Rong You , Wenbin Zou , Xia Li
Existing domain generalization semantic segmentation methods aim to improve the generalization ability by learning domain-invariant information for generalizing well on unseen domains. However, these methods ignore the class discriminability of models, which may lead to a class confusion problem. In this paper, a class-discriminative domain generalization (CDDG) approach is proposed to simultaneously alleviate the distribution shift and class confusion for semantic segmentation. Specifically, a dual prototypical contrastive learning module is proposed. Since the high-frequency component is consistent across different domains, a class-text-guided high-frequency prototypical contrastive learning is proposed. It uses text embeddings as prior knowledge for guiding the learning of high-frequency prototypical representation from high-frequency components to mine domain-invariant information and further improve the generalization ability. However, the domain-specific information may also contain label-related information which refers to the discrimination of a specific class. Thus, only learning the domain-invariant information may limit the class discriminability of models. To address this issue, a low-frequency prototypical contrastive learning is proposed to learn the class-discriminative representation from low-frequency components since it is more domain-specific across different domains. Finally, the class-discriminative representation and high-frequency prototypical representation are fused to simultaneously improve the generalization ability and class discriminability of the model. Extensive experiments demonstrate that the proposed approach outperforms current methods on single- and multi-source domain generalization benchmarks.
{"title":"Class-discriminative domain generalization for semantic segmentation","authors":"Muxin Liao ,&nbsp;Shishun Tian ,&nbsp;Yuhang Zhang ,&nbsp;Guoguang Hua ,&nbsp;Rong You ,&nbsp;Wenbin Zou ,&nbsp;Xia Li","doi":"10.1016/j.imavis.2024.105393","DOIUrl":"10.1016/j.imavis.2024.105393","url":null,"abstract":"<div><div>Existing domain generalization semantic segmentation methods aim to improve the generalization ability by learning domain-invariant information for generalizing well on unseen domains. However, these methods ignore the class discriminability of models, which may lead to a class confusion problem. In this paper, a class-discriminative domain generalization (CDDG) approach is proposed to simultaneously alleviate the distribution shift and class confusion for semantic segmentation. Specifically, a dual prototypical contrastive learning module is proposed. Since the high-frequency component is consistent across different domains, a class-text-guided high-frequency prototypical contrastive learning is proposed. It uses text embeddings as prior knowledge for guiding the learning of high-frequency prototypical representation from high-frequency components to mine domain-invariant information and further improve the generalization ability. However, the domain-specific information may also contain label-related information which refers to the discrimination of a specific class. Thus, only learning the domain-invariant information may limit the class discriminability of models. To address this issue, a low-frequency prototypical contrastive learning is proposed to learn the class-discriminative representation from low-frequency components since it is more domain-specific across different domains. Finally, the class-discriminative representation and high-frequency prototypical representation are fused to simultaneously improve the generalization ability and class discriminability of the model. Extensive experiments demonstrate that the proposed approach outperforms current methods on single- and multi-source domain generalization benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105393"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1