Computer Vision and Image Understanding最新文献_第3页

Cleanness-navigated-contamination network: A unified framework for recovering regional degradation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104274

Qianhao Yu, Naishan Zheng, Jie Huang, Feng Zhao

Image restoration from regional degradation has long been an important and challenging task. The key to contamination removal is recovering the contents of the corrupted regions with the guidance of the non-corrupted regions. Due to the inadequate long-range modeling, the CNN-based approaches cannot thoroughly investigate the information from non-corrupted regions, resulting in distorted visuals with artificial traces between different regions. To address this issue, we propose a novel Cleanness-Navigated-Contamination Network (CNCNet), which is a unified framework for recovering regional image contamination, such as shadow, flare, and other regional degradation. Our method mainly consists of two components: a contamination-oriented adaptive normalization (COAN) module and a contamination-aware aggregation with transformer (CAAT) module based on the contamination region mask. Under the guidance of the contamination mask, the COAN module formulates the statistics from the non-corrupted region and adaptively applies them to the corrupted region for region-wise restoration. The CAAT module utilizes the region mask to precisely guide the restoration of each contaminated pixel by considering the highly relevant pixels from the contamination-free regions for global pixel-wise restoration. Extensive experiments in both shadow removal tasks and flare removal tasks show that our network framework achieves superior restoration performance.

{"title":"Cleanness-navigated-contamination network: A unified framework for recovering regional degradation","authors":"Qianhao Yu, Naishan Zheng, Jie Huang, Feng Zhao","doi":"10.1016/j.cviu.2024.104274","DOIUrl":"10.1016/j.cviu.2024.104274","url":null,"abstract":"<div><div>Image restoration from regional degradation has long been an important and challenging task. The key to contamination removal is recovering the contents of the corrupted regions with the guidance of the non-corrupted regions. Due to the inadequate long-range modeling, the CNN-based approaches cannot thoroughly investigate the information from non-corrupted regions, resulting in distorted visuals with artificial traces between different regions. To address this issue, we propose a novel Cleanness-Navigated-Contamination Network (CNCNet), which is a unified framework for recovering regional image contamination, such as shadow, flare, and other regional degradation. Our method mainly consists of two components: a contamination-oriented adaptive normalization (COAN) module and a contamination-aware aggregation with transformer (CAAT) module based on the contamination region mask. Under the guidance of the contamination mask, the COAN module formulates the statistics from the non-corrupted region and adaptively applies them to the corrupted region for region-wise restoration. The CAAT module utilizes the region mask to precisely guide the restoration of each contaminated pixel by considering the highly relevant pixels from the contamination-free regions for global pixel-wise restoration. Extensive experiments in both shadow removal tasks and flare removal tasks show that our network framework achieves superior restoration performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104274"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Full-body virtual try-on using top and bottom garments with wearing style control

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104259

Soonchan Park , Jinah Park

Various studies have been proposed to synthesize realistic images for image-based virtual try-on, but most of them are limited to replacing a single item on a given model, without considering wearing styles. In this paper, we address the novel problem of full-body virtual try-on with multiple garments by introducing a new benchmark dataset and an image synthesis method. Our Fashion-TB dataset provides comprehensive clothing information by mapping fashion models to their corresponding top and bottom garments, along with semantic region annotations to represent the structure of the garments. WGF-VITON, the single-stage network we have developed, generates full-body try-on images using top and bottom garments simultaneously. Instead of relying on preceding networks to estimate intermediate knowledge, modules for garment transformation and image synthesis are integrated and trained through end-to-end learning. Furthermore, our method proposes Wearing-guide scheme to control the wearing styles in the synthesized try-on images. Through various experiments, for the full-body virtual try-on task, WGF-VITON outperforms state-of-the-art networks in both quantitative and qualitative evaluations with an optimized number of parameters while allowing users to control the wearing styles of the output images. The code and data are available at https://github.com/soonchanpark/WGF-VITON.

{"title":"Full-body virtual try-on using top and bottom garments with wearing style control","authors":"Soonchan Park , Jinah Park","doi":"10.1016/j.cviu.2024.104259","DOIUrl":"10.1016/j.cviu.2024.104259","url":null,"abstract":"<div><div>Various studies have been proposed to synthesize realistic images for image-based virtual try-on, but most of them are limited to replacing a single item on a given model, without considering wearing styles. In this paper, we address the novel problem of <em>full-body</em> virtual try-on with <em>multiple</em> garments by introducing a new benchmark dataset and an image synthesis method. Our Fashion-TB dataset provides comprehensive clothing information by mapping fashion models to their corresponding top and bottom garments, along with semantic region annotations to represent the structure of the garments. WGF-VITON, the single-stage network we have developed, generates full-body try-on images using top and bottom garments simultaneously. Instead of relying on preceding networks to estimate intermediate knowledge, modules for garment transformation and image synthesis are integrated and trained through end-to-end learning. Furthermore, our method proposes Wearing-guide scheme to control the wearing styles in the synthesized try-on images. Through various experiments, for the full-body virtual try-on task, WGF-VITON outperforms state-of-the-art networks in both quantitative and qualitative evaluations with an optimized number of parameters while allowing users to control the wearing styles of the output images. The code and data are available at <span><span>https://github.com/soonchanpark/WGF-VITON</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104259"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SSL-Rehab: Assessment of physical rehabilitation exercises through self-supervised learning of 3D skeleton representations

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104275

Ikram Kourbane, Panagiotis Papadakis, Mihai Andries

Rehabilitation aims to assist individuals in recovering or enhancing functions that have been lost or impaired due to injury, illness, or disease. The automatic assessment of physical rehabilitation exercises offers a valuable method for patient supervision, complementing or potentially substituting traditional clinical evaluations. However, acquiring large-scale annotated datasets presents challenges, prompting the need for self-supervised learning and transfer learning in the rehabilitation domain. Our proposed approach integrates these two strategies through Low-Rank Adaptation (LoRA) for both pretraining and fine-tuning. Specifically, we train a foundation model to learn robust 3D skeleton features that adapt to varying levels of masked motion complexity through a three-stage process. In the first stage, we apply a high masking ratio to a subset of joints, using a transformer-based architecture with a graph embedding layer to capture fundamental motion features. In the second stage, we reduce the masking ratio and expand the model’s capacity to learn more intricate motion patterns and interactions between joints. Finally, in the third stage, we further lower the masking ratio to enable the model to refine its understanding of detailed motion dynamics, optimizing its overall performance. During the second and third stages, LoRA layers are incorporated to extract unique features tailored to each masking level, ensuring efficient adaptation without significantly increasing the model size. Fine-tuning for downstream tasks shows that the model performs better when different masked motion levels are utilized. Through extensive experiments conducted on the publicly available KIMORE and UI-PRMD datasets, we demonstrate the effectiveness of our approach in accurately evaluating the execution quality of rehabilitation exercises, surpassing state-of-the-art performance across all metrics. Our project page is available online.

{"title":"SSL-Rehab: Assessment of physical rehabilitation exercises through self-supervised learning of 3D skeleton representations","authors":"Ikram Kourbane, Panagiotis Papadakis, Mihai Andries","doi":"10.1016/j.cviu.2024.104275","DOIUrl":"10.1016/j.cviu.2024.104275","url":null,"abstract":"<div><div>Rehabilitation aims to assist individuals in recovering or enhancing functions that have been lost or impaired due to injury, illness, or disease. The automatic assessment of physical rehabilitation exercises offers a valuable method for patient supervision, complementing or potentially substituting traditional clinical evaluations. However, acquiring large-scale annotated datasets presents challenges, prompting the need for self-supervised learning and transfer learning in the rehabilitation domain. Our proposed approach integrates these two strategies through Low-Rank Adaptation (LoRA) for both pretraining and fine-tuning. Specifically, we train a foundation model to learn robust 3D skeleton features that adapt to varying levels of masked motion complexity through a three-stage process. In the first stage, we apply a high masking ratio to a subset of joints, using a transformer-based architecture with a graph embedding layer to capture fundamental motion features. In the second stage, we reduce the masking ratio and expand the model’s capacity to learn more intricate motion patterns and interactions between joints. Finally, in the third stage, we further lower the masking ratio to enable the model to refine its understanding of detailed motion dynamics, optimizing its overall performance. During the second and third stages, LoRA layers are incorporated to extract unique features tailored to each masking level, ensuring efficient adaptation without significantly increasing the model size. Fine-tuning for downstream tasks shows that the model performs better when different masked motion levels are utilized. Through extensive experiments conducted on the publicly available KIMORE and UI-PRMD datasets, we demonstrate the effectiveness of our approach in accurately evaluating the execution quality of rehabilitation exercises, surpassing state-of-the-art performance across all metrics. <span><span>Our project page is available online</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104275"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nonlocal Gaussian scale mixture modeling for hyperspectral image denoising

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104270

Ling Ding, Qiong Wang, Yin Poo, Xinggan Zhang

Recent nonlocal sparsity methods have gained significant attention in hyperspectral image (HSI) denoising. These methods leverage the nonlocal self-similarity (NSS) prior to group similar full-band patches into nonlocal full-band groups, followed by enforcing a sparsity constraint, usually through soft-thresholding or hard-thresholding operators, on each nonlocal full-band group. However, in these methods, given that real HSI data are non-stationary and affected by noise, the variances of the sparse coefficients are unknown and challenging to accurately estimate from the degraded HSI, leading to suboptimal denoising performance. In this paper, we propose a novel nonlocal Gaussian scale mixture (NGSM) approach for HSI denoising, which significantly enhances the estimation accuracy of both the variances of the sparse coefficients and the unknown sparse coefficients. To reduce spectral redundancy, a global spectral low-rank (LR) prior is integrated with the NGSM model and consolidated into a variational framework for optimization. Extensive experimental results demonstrate that the proposed NGSM algorithm achieves convincing improvements over many state-of-the-art HSI denoising methods, both in quantitative and visual evaluations, while offering exceptional computational efficiency.

{"title":"Nonlocal Gaussian scale mixture modeling for hyperspectral image denoising","authors":"Ling Ding, Qiong Wang, Yin Poo, Xinggan Zhang","doi":"10.1016/j.cviu.2024.104270","DOIUrl":"10.1016/j.cviu.2024.104270","url":null,"abstract":"<div><div>Recent nonlocal sparsity methods have gained significant attention in hyperspectral image (HSI) denoising. These methods leverage the nonlocal self-similarity (NSS) prior to group similar full-band patches into nonlocal full-band groups, followed by enforcing a sparsity constraint, usually through soft-thresholding or hard-thresholding operators, on each nonlocal full-band group. However, in these methods, given that real HSI data are non-stationary and affected by noise, the variances of the sparse coefficients are unknown and challenging to accurately estimate from the degraded HSI, leading to suboptimal denoising performance. In this paper, we propose a novel nonlocal Gaussian scale mixture (NGSM) approach for HSI denoising, which significantly enhances the estimation accuracy of both the variances of the sparse coefficients and the unknown sparse coefficients. To reduce spectral redundancy, a global spectral low-rank (LR) prior is integrated with the NGSM model and consolidated into a variational framework for optimization. Extensive experimental results demonstrate that the proposed NGSM algorithm achieves convincing improvements over many state-of-the-art HSI denoising methods, both in quantitative and visual evaluations, while offering exceptional computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104270"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104269

Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic

Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the aselmar framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. aselmar (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied aselmar to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean ap score metric. The label verification required by aselmar was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean ap score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, aselmar achieved a decrease in ap score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.

{"title":"ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition","authors":"Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic","doi":"10.1016/j.cviu.2024.104269","DOIUrl":"10.1016/j.cviu.2024.104269","url":null,"abstract":"<div><div>Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the <span>aselmar</span> framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. <span>aselmar</span> (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied <span>aselmar</span> to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean <span>ap</span> score metric. The label verification required by <span>aselmar</span> was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean <span>ap</span> score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, <span>aselmar</span> achieved a decrease in <span>ap</span> score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104269"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RelFormer: Advancing contextual relations for transformer-based dense captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104300

Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei

Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, e.g., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, e.g., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at https://github.com/Wykay/Relformer.

{"title":"RelFormer: Advancing contextual relations for transformer-based dense captioning","authors":"Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei","doi":"10.1016/j.cviu.2025.104300","DOIUrl":"10.1016/j.cviu.2025.104300","url":null,"abstract":"<div><div>Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, <em>e.g</em>., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, <em>e.g</em>., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at <span><span>https://github.com/Wykay/Relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104300"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DM-Align: Leveraging the power of natural language instructions to make changes to images

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104292

Maria-Mihaela Trusca , Tinne Tuytelaars , Marie-Francine Moens

Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.

{"title":"DM-Align: Leveraging the power of natural language instructions to make changes to images","authors":"Maria-Mihaela Trusca , Tinne Tuytelaars , Marie-Francine Moens","doi":"10.1016/j.cviu.2025.104292","DOIUrl":"10.1016/j.cviu.2025.104292","url":null,"abstract":"<div><div>Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104292"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rebalanced supervised contrastive learning with prototypes for long-tailed visual recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104291

Xuhui Chang, Junhai Zhai, Shaoxin Qiu, Zhengrong Sun

In the real world, data often follows a long-tailed distribution, resulting in head classes receiving more attention while tail classes are frequently overlooked. Although supervised contrastive learning (SCL) performs well on balanced datasets, it struggles to distinguish features between tail classes in the latent space when dealing with long-tailed data. To address this issue, we propose Rebalanced Supervised Contrastive Learning (ReCL), which can effectively enhance the separability of tail classes features. Compared with two state-of-the-art methods, Contrastive Learning based hybrid networks (Hybrid-SC) and Targeted Supervised Contrastive Learning (TSC), ReCL has two distinctive characteristics: (1) ReCL enhances the clarity of classification boundaries between tail classes by encouraging samples to align more closely with their corresponding prototypes. (2) ReCL does not require targets generation, thereby conserving computational resources. Our method significantly improves the recognition of tail classes, demonstrating competitive accuracy across multiple long-tailed datasets. Our code has been uploaded to https://github.com/cxh981110/ReCL.

{"title":"Rebalanced supervised contrastive learning with prototypes for long-tailed visual recognition","authors":"Xuhui Chang, Junhai Zhai, Shaoxin Qiu, Zhengrong Sun","doi":"10.1016/j.cviu.2025.104291","DOIUrl":"10.1016/j.cviu.2025.104291","url":null,"abstract":"<div><div>In the real world, data often follows a long-tailed distribution, resulting in head classes receiving more attention while tail classes are frequently overlooked. Although supervised contrastive learning (SCL) performs well on balanced datasets, it struggles to distinguish features between tail classes in the latent space when dealing with long-tailed data. To address this issue, we propose Rebalanced Supervised Contrastive Learning (ReCL), which can effectively enhance the separability of tail classes features. Compared with two state-of-the-art methods, Contrastive Learning based hybrid networks (Hybrid-SC) and Targeted Supervised Contrastive Learning (TSC), ReCL has two distinctive characteristics: (1) ReCL enhances the clarity of classification boundaries between tail classes by encouraging samples to align more closely with their corresponding prototypes. (2) ReCL does not require targets generation, thereby conserving computational resources. Our method significantly improves the recognition of tail classes, demonstrating competitive accuracy across multiple long-tailed datasets. Our code has been uploaded to <span><span>https://github.com/cxh981110/ReCL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104291"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic-preserved point-based human avatar

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104307

Lixiang Lin, Jianke Zhu

To enable realistic experience in AR/VR and digital entertainment, we present the first point-based human avatar model that embodies the entirety expressive range of digital humans. Specifically, we employ two MLPs to model pose-dependent deformation and linear skinning (LBS) weights. The representation of appearance relies on a decoder and the features attached to each point. In contrast to alternative implicit approaches, the oriented points representation not only provides a more intuitive way to model human avatar animation but also significantly reduces the computational time on both training and inference. Moreover, we propose a novel method to transfer semantic information from the SMPL-X model to the points, which enables to better understand human body movements. By leveraging the semantic information of points, we can facilitate virtual try-on and human avatar composition through exchanging the points of same category across different subjects. Experimental results demonstrate the efficacy of our presented method. Our implementation is publicly available at https://github.com/l1346792580123/spa.

{"title":"Semantic-preserved point-based human avatar","authors":"Lixiang Lin, Jianke Zhu","doi":"10.1016/j.cviu.2025.104307","DOIUrl":"10.1016/j.cviu.2025.104307","url":null,"abstract":"<div><div>To enable realistic experience in AR/VR and digital entertainment, we present the first point-based human avatar model that embodies the entirety expressive range of digital humans. Specifically, we employ two MLPs to model pose-dependent deformation and linear skinning (LBS) weights. The representation of appearance relies on a decoder and the features attached to each point. In contrast to alternative implicit approaches, the oriented points representation not only provides a more intuitive way to model human avatar animation but also significantly reduces the computational time on both training and inference. Moreover, we propose a novel method to transfer semantic information from the SMPL-X model to the points, which enables to better understand human body movements. By leveraging the semantic information of points, we can facilitate virtual try-on and human avatar composition through exchanging the points of same category across different subjects. Experimental results demonstrate the efficacy of our presented method. Our implementation is publicly available at <span><span>https://github.com/l1346792580123/spa</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104307"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph-based Dense Event Grounding with relative positional encoding

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104257

Jianxiang Dong, Zhaozheng Yin

Temporal Sentence Grounding (TSG) in videos aims to localize a temporal moment from an untrimmed video that is relevant to a given query sentence. Most existing methods focus on addressing the problem of single sentence grounding. Recently, researchers proposed a new Dense Event Grounding (DEG) problem by extending the single event localization to a multi-event localization, where the temporal moments of multiple events described by multiple sentences are retrieved. In this paper, we introduce an effective proposal-based approach to solve the DEG problem. A Relative Sentence Interaction (RSI) module using graph neural network is proposed to model the event relationship by introducing a temporal relative positional encoding to learn the relative temporal order information between sentences in a dense multi-sentence query. In addition, we design an Event-contextualized Cross-modal Interaction (ECI) module to tackle the lack of global information from other related events when fusing visual and sentence features. Finally, we construct an Event Graph (EG) with intra-event edges and inter-event edges to model the relationship between proposals in the same event and proposals in different events to further refine their representations for final localizations. Extensive experiments on ActivityNet-Captions and TACoS datasets show the effectiveness of our solution.

{"title":"Graph-based Dense Event Grounding with relative positional encoding","authors":"Jianxiang Dong, Zhaozheng Yin","doi":"10.1016/j.cviu.2024.104257","DOIUrl":"10.1016/j.cviu.2024.104257","url":null,"abstract":"<div><div>Temporal Sentence Grounding (TSG) in videos aims to localize a temporal moment from an untrimmed video that is relevant to a given query sentence. Most existing methods focus on addressing the problem of single sentence grounding. Recently, researchers proposed a new Dense Event Grounding (DEG) problem by extending the single event localization to a multi-event localization, where the temporal moments of multiple events described by multiple sentences are retrieved. In this paper, we introduce an effective proposal-based approach to solve the DEG problem. A Relative Sentence Interaction (RSI) module using graph neural network is proposed to model the event relationship by introducing a temporal relative positional encoding to learn the relative temporal order information between sentences in a dense multi-sentence query. In addition, we design an Event-contextualized Cross-modal Interaction (ECI) module to tackle the lack of global information from other related events when fusing visual and sentence features. Finally, we construct an Event Graph (EG) with intra-event edges and inter-event edges to model the relationship between proposals in the same event and proposals in different events to further refine their representations for final localizations. Extensive experiments on ActivityNet-Captions and TACoS datasets show the effectiveness of our solution.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104257"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0