Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2024.104247
Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang
Most network pruning methods focus on identifying redundant channels from pre-trained models, which is inefficient due to its three-step process: pre-training, pruning and fine-tuning, and reconfiguration. In this paper, we propose a pruning-from-scratch framework that unifies these processes into a single approach. We introduce nuclear norm-based regularization to maintain the representational capacity of large networks during pruning. Combining this with MACs-based regularization enhances the performance of the pruned network at the target compression rate. Our bi-level optimization approach simultaneously improves pruning efficiency and representation capacity. Experimental results show that our method achieves 75.4% accuracy on ImageNet without a pre-trained network, using only 41% of the original model’s computational cost. It also attains 0.5% higher performance in compressing the SSD network for object detection. Furthermore, we analyze the effects of nuclear norm-based regularization.
{"title":"Pruning networks at once via nuclear norm-based regularization and bi-level optimization","authors":"Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang","doi":"10.1016/j.cviu.2024.104247","DOIUrl":"10.1016/j.cviu.2024.104247","url":null,"abstract":"<div><div>Most network pruning methods focus on identifying redundant channels from pre-trained models, which is inefficient due to its three-step process: pre-training, pruning and fine-tuning, and reconfiguration. In this paper, we propose a pruning-from-scratch framework that unifies these processes into a single approach. We introduce nuclear norm-based regularization to maintain the representational capacity of large networks during pruning. Combining this with MACs-based regularization enhances the performance of the pruned network at the target compression rate. Our bi-level optimization approach simultaneously improves pruning efficiency and representation capacity. Experimental results show that our method achieves 75.4% accuracy on ImageNet without a pre-trained network, using only 41% of the original model’s computational cost. It also attains 0.5% higher performance in compressing the SSD network for object detection. Furthermore, we analyze the effects of nuclear norm-based regularization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104247"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2024.104252
Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang
Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.
{"title":"Adversarial intensity awareness for robust object detection","authors":"Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang","doi":"10.1016/j.cviu.2024.104252","DOIUrl":"10.1016/j.cviu.2024.104252","url":null,"abstract":"<div><div>Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104252"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2025.104279
Tianshu Li, Shigang Wang
Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.
{"title":"Joint Generating Terminal Correction Imaging method for modular LED integral imaging systems","authors":"Tianshu Li, Shigang Wang","doi":"10.1016/j.cviu.2025.104279","DOIUrl":"10.1016/j.cviu.2025.104279","url":null,"abstract":"<div><div>Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104279"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2025.104299
Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta
Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git
{"title":"Semantic scene understanding through advanced object context analysis in image","authors":"Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta","doi":"10.1016/j.cviu.2025.104299","DOIUrl":"10.1016/j.cviu.2025.104299","url":null,"abstract":"<div><div>Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: <span><span>https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104299"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2024.104272
Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao
Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.
{"title":"Self-supervised vision transformers for semantic segmentation","authors":"Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao","doi":"10.1016/j.cviu.2024.104272","DOIUrl":"10.1016/j.cviu.2024.104272","url":null,"abstract":"<div><div>Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104272"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RSLip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RSLip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.
{"title":"RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP","authors":"Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee","doi":"10.1016/j.cviu.2024.104254","DOIUrl":"10.1016/j.cviu.2024.104254","url":null,"abstract":"<div><div>Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104254"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.cviu.2024.104255
Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi
Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.
{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":"10.1016/j.cviu.2024.104255","url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-29DOI: 10.1016/j.cviu.2025.104296
Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu
The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at https://github.com/SWU-CS-MediaLab/MPG-LQGNet.
{"title":"Mask prior generation with language queries guided networks for referring image segmentation","authors":"Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu","doi":"10.1016/j.cviu.2025.104296","DOIUrl":"10.1016/j.cviu.2025.104296","url":null,"abstract":"<div><div>The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at <span><span>https://github.com/SWU-CS-MediaLab/MPG-LQGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104296"},"PeriodicalIF":4.3,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143136341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-20DOI: 10.1016/j.cviu.2024.104246
Haibo Chen , Lei Zhao
Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel Unpaired Arbitrary Text-guided Style Transfer (UATST) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.
{"title":"UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation","authors":"Haibo Chen , Lei Zhao","doi":"10.1016/j.cviu.2024.104246","DOIUrl":"10.1016/j.cviu.2024.104246","url":null,"abstract":"<div><div>Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel <strong>U</strong>npaired <strong>A</strong>rbitrary <strong>T</strong>ext-guided <strong>S</strong>tyle <strong>T</strong>ransfer (<strong>UATST</strong>) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104246"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}