Computer Vision and Image Understanding最新文献

英文中文

Pruning networks at once via nuclear norm-based regularization and bi-level optimization

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104247

Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang

Most network pruning methods focus on identifying redundant channels from pre-trained models, which is inefficient due to its three-step process: pre-training, pruning and fine-tuning, and reconfiguration. In this paper, we propose a pruning-from-scratch framework that unifies these processes into a single approach. We introduce nuclear norm-based regularization to maintain the representational capacity of large networks during pruning. Combining this with MACs-based regularization enhances the performance of the pruned network at the target compression rate. Our bi-level optimization approach simultaneously improves pruning efficiency and representation capacity. Experimental results show that our method achieves 75.4% accuracy on ImageNet without a pre-trained network, using only 41% of the original model’s computational cost. It also attains 0.5% higher performance in compressing the SSD network for object detection. Furthermore, we analyze the effects of nuclear norm-based regularization.

引用次数: 0

Adversarial intensity awareness for robust object detection

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104252

Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang

Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.

{"title":"Adversarial intensity awareness for robust object detection","authors":"Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang","doi":"10.1016/j.cviu.2024.104252","DOIUrl":"10.1016/j.cviu.2024.104252","url":null,"abstract":"<div><div>Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104252"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Generating Terminal Correction Imaging method for modular LED integral imaging systems

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104279

Tianshu Li, Shigang Wang

Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.

{"title":"Joint Generating Terminal Correction Imaging method for modular LED integral imaging systems","authors":"Tianshu Li, Shigang Wang","doi":"10.1016/j.cviu.2025.104279","DOIUrl":"10.1016/j.cviu.2025.104279","url":null,"abstract":"<div><div>Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104279"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic scene understanding through advanced object context analysis in image

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104299

Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta

Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git

{"title":"Semantic scene understanding through advanced object context analysis in image","authors":"Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta","doi":"10.1016/j.cviu.2025.104299","DOIUrl":"10.1016/j.cviu.2025.104299","url":null,"abstract":"<div><div>Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: <span><span>https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104299"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum to “LightSOD: Towards lightweight and efficient network for salient object detection” [J. Comput. Vis. Imag. Underst. 249 (2024) 104148]

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104277

Thien-Thu Ngo , Hoang Ngoc Tran , Md. Delowar Hossain , Eui-Nam Huh

引用次数: 0

Self-supervised vision transformers for semantic segmentation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104272

Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao

Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.

{"title":"Self-supervised vision transformers for semantic segmentation","authors":"Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao","doi":"10.1016/j.cviu.2024.104272","DOIUrl":"10.1016/j.cviu.2024.104272","url":null,"abstract":"<div><div>Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104272"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104254

Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee

Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS

^{3}

Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS

^{3}

Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.

{"title":"RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP","authors":"Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee","doi":"10.1016/j.cviu.2024.104254","DOIUrl":"10.1016/j.cviu.2024.104254","url":null,"abstract":"<div><div>Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104254"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive semantic guidance network for video captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104255

Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi

Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.

{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":"10.1016/j.cviu.2024.104255","url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mask prior generation with language queries guided networks for referring image segmentation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-01-29 DOI: 10.1016/j.cviu.2025.104296

Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu

The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at https://github.com/SWU-CS-MediaLab/MPG-LQGNet.

{"title":"Mask prior generation with language queries guided networks for referring image segmentation","authors":"Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu","doi":"10.1016/j.cviu.2025.104296","DOIUrl":"10.1016/j.cviu.2025.104296","url":null,"abstract":"<div><div>The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at <span><span>https://github.com/SWU-CS-MediaLab/MPG-LQGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104296"},"PeriodicalIF":4.3,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143136341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation UATST：通过跨空间调制实现不配对的任意文本引导样式传输

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104246

Haibo Chen , Lei Zhao

Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel Unpaired Arbitrary Text-guided Style Transfer (UATST) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.

现有的风格转移方法通常使用风格图像来表示目标风格。由于风格图像需要提前准备，并且局限于现有的艺术品，因此这些方法的灵活性和创造性有限。与图像相比，语言是人类传递信息的一种更自然、更普遍、更灵活的方式。因此，更好的选择是使用文本描述而不是样式图像来表示目标样式。为此，我们提出了一种新的非配对任意文本引导风格迁移（UATST）框架，该框架可以用一个模型以任意文本描述的风格呈现任意照片。据我们所知，这是第一个用未配对的训练数据实现任意文本模型的模型。具体而言，我们首先使用预训练的VGG网络将内容图像映射到VGG特征空间，并使用预训练的CLIP文本编码器将文本描述映射到CLIP特征空间。然后，我们引入跨空间调制模块来桥接这两个特征空间，使两个不同空间中的内容和样式信息可以无缝地、自适应地组合在一起进行风格化。此外，为了学习更好的风格表示，我们在模型中引入了一个新的基于clip的风格对比损失。大量的定性和定量实验验证了该方法的有效性和优越性。

{"title":"UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation","authors":"Haibo Chen , Lei Zhao","doi":"10.1016/j.cviu.2024.104246","DOIUrl":"10.1016/j.cviu.2024.104246","url":null,"abstract":"<div><div>Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel <strong>U</strong>npaired <strong>A</strong>rbitrary <strong>T</strong>ext-guided <strong>S</strong>tyle <strong>T</strong>ransfer (<strong>UATST</strong>) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104246"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computer Vision and Image Understanding

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀