Computer Vision and Image Understanding最新文献_第2页

YES: You should Examine Suspect cues for low-light object detection

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104271

Shu Ye , Wenxin Huang , Wenxuan Liu , Liang Chen , Xiao Wang , Xian Zhong

Object detection in low-light conditions presents substantial challenges, particularly the issue we define as “low-light object-background cheating”. This phenomenon arises from uneven lighting, leading to blurred and inaccurate object edges. Most existing methods focus on basic feature enhancement and addressing the gap between normal-light and synthetic low-light conditions. However, they often overlook the complexities introduced by uneven lighting in real-world environments. To address this, we propose a novel low-light object detection framework, You Examine Suspect (YES), comprising two key components: the Optical Balance Enhancer (OBE) and the Entanglement Attenuation Module (EAM). The OBE emphasizes “balance” by employing techniques such as inverse tone mapping, white balance, and gamma correction to recover details in dark regions while adjusting brightness and contrast without introducing noise. The EAM focuses on “disentanglement” by analyzing both object regions and surrounding areas affected by lighting variations and integrating multi-scale contextual information to clarify ambiguous features. Extensive experiments on ExDark and Dark Face datasets demonstrate the superior performance of proposed YES, validating its effectiveness in low-light object detection tasks. The code will be available at https://github.com/Regina971/YES.

{"title":"YES: You should Examine Suspect cues for low-light object detection","authors":"Shu Ye , Wenxin Huang , Wenxuan Liu , Liang Chen , Xiao Wang , Xian Zhong","doi":"10.1016/j.cviu.2024.104271","DOIUrl":"10.1016/j.cviu.2024.104271","url":null,"abstract":"<div><div>Object detection in low-light conditions presents substantial challenges, particularly the issue we define as “low-light object-background cheating”. This phenomenon arises from uneven lighting, leading to blurred and inaccurate object edges. Most existing methods focus on basic feature enhancement and addressing the gap between normal-light and synthetic low-light conditions. However, they often overlook the complexities introduced by uneven lighting in real-world environments. To address this, we propose a novel low-light object detection framework, You Examine Suspect (YES), comprising two key components: the Optical Balance Enhancer (OBE) and the Entanglement Attenuation Module (EAM). The OBE emphasizes “balance” by employing techniques such as inverse tone mapping, white balance, and gamma correction to recover details in dark regions while adjusting brightness and contrast without introducing noise. The EAM focuses on “disentanglement” by analyzing both object regions and surrounding areas affected by lighting variations and integrating multi-scale contextual information to clarify ambiguous features. Extensive experiments on <span>ExDark</span> and <span>Dark Face</span> datasets demonstrate the superior performance of proposed YES, validating its effectiveness in low-light object detection tasks. The code will be available at <span><span>https://github.com/Regina971/YES</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104271"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to mask and permute visual tokens for Vision Transformer pre-training

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104294

Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed

k

-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at https://github.com/aimagelab/MaPeT.

{"title":"Learning to mask and permute visual tokens for Vision Transformer pre-training","authors":"Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara","doi":"10.1016/j.cviu.2025.104294","DOIUrl":"10.1016/j.cviu.2025.104294","url":null,"abstract":"<div><div>The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed <span><math><mi>k</mi></math></span>-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at <span><span>https://github.com/aimagelab/MaPeT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104294"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143097183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph-based Moving Object Segmentation for underwater videos using semi-supervised learning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104290

Meghna Kapoor , Wieke Prummel , Jhony H. Giraldo , Badri Narayan Subudhi , Anastasia Zakharova , Thierry Bouwmans , Ankur Bansal

Moving object segmentation (MOS) using passive underwater image processing is an important technology for monitoring marine habitats. It aids marine biologists studying biological oceanography and the associated fields of chemical, physical, and geological oceanography to understand marine organisms. Dynamic backgrounds due to marine organisms like algae and seaweed, and improper illumination of the environment pose challenges in detecting moving objects in the scene. Previous graph-learning methods have shown promising results in MOS, but are mostly limited to terrestrial surface videos such as traffic video surveillance. Traditional object modeling fails in underwater scenes, due to fish shape and color degradation in motion and the lack of extensive underwater datasets for deep-learning models. Therefore, we propose a semi-supervised graph-learning approach (GraphMOS-U) to segment moving objects in underwater environments. Additionally, existing datasets were consolidated to form the proposed Teleost Fish Classification Dataset, specifically designed for fish classification tasks in complex environments to avoid unseen scenes, ensuring the replication of the transfer learning process on a ResNet-50 backbone. GraphMOS-U uses a six-step approach with transfer learning using Mask R-CNN and a ResNet-50 backbone for instance segmentation, followed by feature extraction using optical flow, visual saliency, and texture. After concatenating these features, a

k

-NN Graph is constructed, and graph node classification is applied to label objects as foreground or background. The foreground nodes are used to reconstruct the segmentation map of the moving object from the scene. Quantitative and qualitative experiments demonstrate that GraphMOS-U outperforms state-of-the-art algorithms, accurately detecting moving objects while preserving fine details. The proposed method enables the use of graph-based MOS algorithms in underwater scenes.

{"title":"Graph-based Moving Object Segmentation for underwater videos using semi-supervised learning","authors":"Meghna Kapoor , Wieke Prummel , Jhony H. Giraldo , Badri Narayan Subudhi , Anastasia Zakharova , Thierry Bouwmans , Ankur Bansal","doi":"10.1016/j.cviu.2025.104290","DOIUrl":"10.1016/j.cviu.2025.104290","url":null,"abstract":"<div><div>Moving object segmentation (MOS) using passive underwater image processing is an important technology for monitoring marine habitats. It aids marine biologists studying biological oceanography and the associated fields of chemical, physical, and geological oceanography to understand marine organisms. Dynamic backgrounds due to marine organisms like algae and seaweed, and improper illumination of the environment pose challenges in detecting moving objects in the scene. Previous graph-learning methods have shown promising results in MOS, but are mostly limited to terrestrial surface videos such as traffic video surveillance. Traditional object modeling fails in underwater scenes, due to fish shape and color degradation in motion and the lack of extensive underwater datasets for deep-learning models. Therefore, we propose a semi-supervised graph-learning approach (GraphMOS-U) to segment moving objects in underwater environments. Additionally, existing datasets were consolidated to form the proposed Teleost Fish Classification Dataset, specifically designed for fish classification tasks in complex environments to avoid unseen scenes, ensuring the replication of the transfer learning process on a ResNet-50 backbone. GraphMOS-U uses a six-step approach with transfer learning using Mask R-CNN and a ResNet-50 backbone for instance segmentation, followed by feature extraction using optical flow, visual saliency, and texture. After concatenating these features, a <span><math><mi>k</mi></math></span>-NN Graph is constructed, and graph node classification is applied to label objects as foreground or background. The foreground nodes are used to reconstruct the segmentation map of the moving object from the scene. Quantitative and qualitative experiments demonstrate that GraphMOS-U outperforms state-of-the-art algorithms, accurately detecting moving objects while preserving fine details. The proposed method enables the use of graph-based MOS algorithms in underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104290"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Illumination-aware and structure-guided transformer for low-light image enhancement

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104276

Guodong Fan , Zishu Yao , Min Gan

In this paper, we proposed a novel illumination-aware and structure-guided transformer that achieves efficient image enhancement by focusing on brightness degradation and precise high-frequency guidance. Specifically, low-light images often contain numerous regions with similar brightness levels but different spatial locations. However, existing attention mechanisms only compute self-attention using channel dimensions or fixed-size spatial blocks, which limits their ability to capture long-range features, making it challenging to achieve satisfactory image restoration quality. At the same time, the details of low-light images are mostly hidden in the darkness. However, existing models often give equal attention to both high-frequency and smooth regions, which makes it difficult to capture the details of deep degradation, resulting in blurry recovered image details. On the one hand, we introduced a dynamic brightness multi-domain self-attention mechanism that selectively focuses on spatial features within dynamic ranges and incorporates frequency domain information. This approach allows the model to capture both local details and global features, restoring global brightness while paying closer attention to regions with similar degradation. On the other hand, we proposed a global maximum gradient search strategy to guide the model’s attention towards high-frequency detail regions, thereby achieving a more fine-grained restored image. Extensive experiments on various benchmark datasets demonstrate that our method achieves state-of-the-art performance.

{"title":"Illumination-aware and structure-guided transformer for low-light image enhancement","authors":"Guodong Fan , Zishu Yao , Min Gan","doi":"10.1016/j.cviu.2024.104276","DOIUrl":"10.1016/j.cviu.2024.104276","url":null,"abstract":"<div><div>In this paper, we proposed a novel illumination-aware and structure-guided transformer that achieves efficient image enhancement by focusing on brightness degradation and precise high-frequency guidance. Specifically, low-light images often contain numerous regions with similar brightness levels but different spatial locations. However, existing attention mechanisms only compute self-attention using channel dimensions or fixed-size spatial blocks, which limits their ability to capture long-range features, making it challenging to achieve satisfactory image restoration quality. At the same time, the details of low-light images are mostly hidden in the darkness. However, existing models often give equal attention to both high-frequency and smooth regions, which makes it difficult to capture the details of deep degradation, resulting in blurry recovered image details. On the one hand, we introduced a dynamic brightness multi-domain self-attention mechanism that selectively focuses on spatial features within dynamic ranges and incorporates frequency domain information. This approach allows the model to capture both local details and global features, restoring global brightness while paying closer attention to regions with similar degradation. On the other hand, we proposed a global maximum gradient search strategy to guide the model’s attention towards high-frequency detail regions, thereby achieving a more fine-grained restored image. Extensive experiments on various benchmark datasets demonstrate that our method achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104276"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building extraction from remote sensing images with deep learning: A survey on vision techniques

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104253

Yuan Yuan, Xiaofeng Shi, Junyu Gao

Building extraction from remote sensing images is a hot topic in the fields of computer vision and remote sensing. In recent years, driven by deep learning, the accuracy of building extraction has been improved significantly. This survey offers a review of recent deep learning-based building extraction methods, systematically covering concepts like representation learning, efficient data utilization, multi-source fusion, and polygonal outputs, which have been rarely addressed in previous surveys comprehensively, thereby complementing existing research. Specifically, we first briefly introduce the relevant preliminaries and the challenges of building extraction with deep learning. Then we construct a systematic and instructive taxonomy from two perspectives: (1) representation and learning-oriented perspective and (2) input and output-oriented perspective. With this taxonomy, the recent building extraction methods are summarized. Furthermore, we introduce the key attributes of extensive publicly available benchmark datasets, the performance of some state-of-the-art models and the free-available products. Finally, we prospect the future research directions from three aspects.

{"title":"Building extraction from remote sensing images with deep learning: A survey on vision techniques","authors":"Yuan Yuan, Xiaofeng Shi, Junyu Gao","doi":"10.1016/j.cviu.2024.104253","DOIUrl":"10.1016/j.cviu.2024.104253","url":null,"abstract":"<div><div>Building extraction from remote sensing images is a hot topic in the fields of computer vision and remote sensing. In recent years, driven by deep learning, the accuracy of building extraction has been improved significantly. This survey offers a review of recent deep learning-based building extraction methods, systematically covering concepts like representation learning, efficient data utilization, multi-source fusion, and polygonal outputs, which have been rarely addressed in previous surveys comprehensively, thereby complementing existing research. Specifically, we first briefly introduce the relevant preliminaries and the challenges of building extraction with deep learning. Then we construct a systematic and instructive taxonomy from two perspectives: (1) representation and learning-oriented perspective and (2) input and output-oriented perspective. With this taxonomy, the recent building extraction methods are summarized. Furthermore, we introduce the key attributes of extensive publicly available benchmark datasets, the performance of some state-of-the-art models and the free-available products. Finally, we prospect the future research directions from three aspects.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104253"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From bias to balance: Leverage representation learning for bias-free MoCap solving

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104241

Georgios Albanis , Nikolaos Zioulis , Spyridon Thermos , Anargyros Chatzitofis , Kostas Kolomvatsos

Motion Capture (MoCap) is still dominated by optical MoCap as it remains the gold standard. However, the raw captured data even from such systems suffer from high-frequency noise and errors sourced from ghost or occluded markers. To that end, a post-processing step is often required to clean up the data, which is typically a tedious and time-consuming process. Some studies tried to address these issues in a data-driven manner, leveraging the availability of MoCap data. However, there is a high-level data redundancy in such data, as the motion cycle is usually comprised of similar poses (e.g. standing still). Such redundancies affect the performance of those methods, especially in the rarer poses. In this work, we address the issue of long-tailed data distribution by leveraging representation learning. We introduce a novel technique for imbalanced regression that does not require additional data or labels. Our approach uses a Mahalanobis distance-based method for automatically identifying rare samples and properly reweighting them during training, while at the same time, we employ high-order interpolation algorithms to effectively sample the latent space of a Variational Autoencoder (VAE) to generate new tail samples. We prove that the proposed approach can significantly improve the results, especially in the tail samples, while at the same time is a model-agnostic method and can be applied across various architectures.

{"title":"From bias to balance: Leverage representation learning for bias-free MoCap solving","authors":"Georgios Albanis , Nikolaos Zioulis , Spyridon Thermos , Anargyros Chatzitofis , Kostas Kolomvatsos","doi":"10.1016/j.cviu.2024.104241","DOIUrl":"10.1016/j.cviu.2024.104241","url":null,"abstract":"<div><div>Motion Capture (MoCap) is still dominated by optical MoCap as it remains the gold standard. However, the raw captured data even from such systems suffer from high-frequency noise and errors sourced from ghost or occluded markers. To that end, a post-processing step is often required to clean up the data, which is typically a tedious and time-consuming process. Some studies tried to address these issues in a data-driven manner, leveraging the availability of MoCap data. However, there is a high-level data redundancy in such data, as the motion cycle is usually comprised of similar poses (e.g. standing still). Such redundancies affect the performance of those methods, especially in the rarer poses. In this work, we address the issue of long-tailed data distribution by leveraging representation learning. We introduce a novel technique for imbalanced regression that does not require additional data or labels. Our approach uses a Mahalanobis distance-based method for automatically identifying rare samples and properly reweighting them during training, while at the same time, we employ high-order interpolation algorithms to effectively sample the latent space of a Variational Autoencoder (VAE) to generate new tail samples. We prove that the proposed approach can significantly improve the results, especially in the tail samples, while at the same time is a model-agnostic method and can be applied across various architectures.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104241"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UAV-based person re-identification: A survey of UAV datasets, approaches, and challenges

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104261

Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja

Person re-identification (ReID) has gained significant interest due to growing public safety concerns that require advanced surveillance and identification mechanisms. While most existing ReID research relies on static surveillance cameras, the use of Unmanned Aerial Vehicles (UAVs) for surveillance has recently gained popularity. Noting the promising application of UAVs in ReID, this paper presents a comprehensive overview of UAV-based ReID, highlighting publicly available datasets, key challenges, and methodologies. We summarize and consolidate evaluations conducted across multiple studies, providing a unified perspective on the state of UAV-based ReID research. Despite their limited size and diversity, We underscore current datasets’ importance in advancing UAV-based ReID research. The survey also presents a list of all available approaches for UAV-based ReID. The survey presents challenges associated with UAV-based ReID, including environmental conditions, image quality issues, and privacy concerns. We discuss dynamic adaptation techniques, multi-model fusion, and lightweight algorithms to leverage ground-based person ReID datasets for UAV applications. Finally, we explore potential research directions, highlighting the need for diverse datasets, lightweight algorithms, and innovative approaches to tackle the unique challenges of UAV-based person ReID.

{"title":"UAV-based person re-identification: A survey of UAV datasets, approaches, and challenges","authors":"Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja","doi":"10.1016/j.cviu.2024.104261","DOIUrl":"10.1016/j.cviu.2024.104261","url":null,"abstract":"<div><div>Person re-identification (ReID) has gained significant interest due to growing public safety concerns that require advanced surveillance and identification mechanisms. While most existing ReID research relies on static surveillance cameras, the use of Unmanned Aerial Vehicles (UAVs) for surveillance has recently gained popularity. Noting the promising application of UAVs in ReID, this paper presents a comprehensive overview of UAV-based ReID, highlighting publicly available datasets, key challenges, and methodologies. We summarize and consolidate evaluations conducted across multiple studies, providing a unified perspective on the state of UAV-based ReID research. Despite their limited size and diversity, We underscore current datasets’ importance in advancing UAV-based ReID research. The survey also presents a list of all available approaches for UAV-based ReID. The survey presents challenges associated with UAV-based ReID, including environmental conditions, image quality issues, and privacy concerns. We discuss dynamic adaptation techniques, multi-model fusion, and lightweight algorithms to leverage ground-based person ReID datasets for UAV applications. Finally, we explore potential research directions, highlighting the need for diverse datasets, lightweight algorithms, and innovative approaches to tackle the unique challenges of UAV-based person ReID.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104261"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MASK_LOSS guided non-end-to-end image denoising network based on multi-attention module with bias rectified linear unit and absolute pooling unit

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104302

Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li

Deep learning-based image denoising algorithms have demonstrated superior denoising performance but suffer from loss of details and excessive smoothing of edges after denoising. In addition, these denoising models often involve redundant calculations, resulting in low utilization rates and poor generalization capabilities. To address these challenges, we proposes an Non-end-to-end Multi-Attention Denoising Network (N-ete MADN). Firstly, we propose a Bias Rectified Linear Unit (BReLU) to replace ReLU as the activation function, which provides enhanced flexibility and expanded activation range without additional computation, constructing a Feature Extraction Unit (FEU) with depth-wise convolutions (DConv). Then an Absolute Pooling Unit (AbsPooling-unit) is proposed to consist Channel Attention Block(CAB), Spatial Attention Block(SAB) and Channel Similarity Attention Block (CSAB) , which are integrated into a Multi-Attention Module (MAM). CAB and SAB aim to enhance the model’s focus on key information respectively in the spatial dimension and the channel dimension, while CSAB aims to improve the model’s ability to detect similar features. Finally, the MAM is utilized to construct a Multi-Attention Denoising Network (MADN). Then a mask loss function (MASK_LOSS) and a compound multi-stage denoising network called Non-end-to-end Multi-Attention Denoising Network (N-ete MADN) based on the loss and MADN are proposed, which aim to handle the image with rich edge information, providing enhanced protection for edges and facilitating the reconstruction of edge information after image denoising. This framework enhances learning capacity and efficiency, effectively addressing edge detail loss challenges in denoising tasks. Experimental results on both synthetic several datasets demonstrate that our model can achieve the state-of-the-art denoising performance with low computational costs.

{"title":"MASK_LOSS guided non-end-to-end image denoising network based on multi-attention module with bias rectified linear unit and absolute pooling unit","authors":"Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li","doi":"10.1016/j.cviu.2025.104302","DOIUrl":"10.1016/j.cviu.2025.104302","url":null,"abstract":"<div><div>Deep learning-based image denoising algorithms have demonstrated superior denoising performance but suffer from loss of details and excessive smoothing of edges after denoising. In addition, these denoising models often involve redundant calculations, resulting in low utilization rates and poor generalization capabilities. To address these challenges, we proposes an Non-end-to-end Multi-Attention Denoising Network (N-ete MADN). Firstly, we propose a Bias Rectified Linear Unit (BReLU) to replace ReLU as the activation function, which provides enhanced flexibility and expanded activation range without additional computation, constructing a Feature Extraction Unit (FEU) with depth-wise convolutions (DConv). Then an Absolute Pooling Unit (AbsPooling-unit) is proposed to consist Channel Attention Block(CAB), Spatial Attention Block(SAB) and Channel Similarity Attention Block (CSAB) , which are integrated into a Multi-Attention Module (MAM). CAB and SAB aim to enhance the model’s focus on key information respectively in the spatial dimension and the channel dimension, while CSAB aims to improve the model’s ability to detect similar features. Finally, the MAM is utilized to construct a Multi-Attention Denoising Network (MADN). Then a mask loss function (MASK_LOSS) and a compound multi-stage denoising network called Non-end-to-end Multi-Attention Denoising Network (N-ete MADN) based on the loss and MADN are proposed, which aim to handle the image with rich edge information, providing enhanced protection for edges and facilitating the reconstruction of edge information after image denoising. This framework enhances learning capacity and efficiency, effectively addressing edge detail loss challenges in denoising tasks. Experimental results on both synthetic several datasets demonstrate that our model can achieve the state-of-the-art denoising performance with low computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104302"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Collaborative Neural Painting

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104298

Nicola Dall’Asen , Willi Menapace , Elia Peruzzo , Enver Sangineto , Yiming Wang , Elisa Ricci

The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, Collaborative Neural Painting (CNP), to facilitate collaborative art painting generation between users and agents. Given any number of user-input brushstrokes as the context or just the desired object class, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research. Project page and code: this https URL.

{"title":"Collaborative Neural Painting","authors":"Nicola Dall’Asen , Willi Menapace , Elia Peruzzo , Enver Sangineto , Yiming Wang , Elisa Ricci","doi":"10.1016/j.cviu.2025.104298","DOIUrl":"10.1016/j.cviu.2025.104298","url":null,"abstract":"<div><div>The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, <em>Collaborative Neural Painting (CNP)</em>, to facilitate collaborative art painting generation between users and agents. Given any number of user-input <em>brushstrokes</em> as the context or just the desired <em>object class</em>, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research. Project page and code: <span><span>this https URL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104298"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing Human Pose Estimation through deep learning approaches: An overview

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104297

Gaetano Dibenedetto , Stefanos Sotiropoulos , Marco Polignano , Giuseppe Cavallo , Pasquale Lops

In the everyday IoT ecosystem, many devices and systems are interconnected in an intelligent living environment to create a comfortable and efficient living space. In this scenario, approaches based on automatic recognition of actions and events can support fully autonomous digital assistants and personalized services. A pivotal component in this domain is “Human Pose Estimation”, which plays a critical role in action recognition for a wide range of applications, including home automation, healthcare, safety, and security. These systems are designed to detect human actions and deliver customized real-time responses and support. Selecting an appropriate technique for Human Pose Estimation is crucial to enhancing these systems for various applications. This choice hinges on the specific environment and can be categorized on the basis of whether the technique is designed for images or videos, single-person or multi-person scenarios, and monocular or multiview inputs. A comprehensive overview of recent research outcomes is essential to showcase the evolution of the research area, along with its underlying principles and varied application domains. Key benchmarks across these techniques are suitable and provide valuable insights into their performance. Hence, the paper summarizes these benchmarks, offering a comparative analysis of the techniques. As research in this field continues to evolve, it is critical for researchers to stay up to date with the latest developments and methodologies to promote further innovations in the field of pose estimation research. Therefore, this comprehensive overview presents a thorough examination of the subject matter, encompassing all pertinent details. Its objective is to equip researchers with the knowledge and resources necessary to investigate the topic and effectively retrieve all relevant information necessary for their investigations.

{"title":"Comparing Human Pose Estimation through deep learning approaches: An overview","authors":"Gaetano Dibenedetto , Stefanos Sotiropoulos , Marco Polignano , Giuseppe Cavallo , Pasquale Lops","doi":"10.1016/j.cviu.2025.104297","DOIUrl":"10.1016/j.cviu.2025.104297","url":null,"abstract":"<div><div>In the everyday IoT ecosystem, many devices and systems are interconnected in an intelligent living environment to create a comfortable and efficient living space. In this scenario, approaches based on automatic recognition of actions and events can support fully autonomous digital assistants and personalized services. A pivotal component in this domain is “Human Pose Estimation”, which plays a critical role in action recognition for a wide range of applications, including home automation, healthcare, safety, and security. These systems are designed to detect human actions and deliver customized real-time responses and support. Selecting an appropriate technique for Human Pose Estimation is crucial to enhancing these systems for various applications. This choice hinges on the specific environment and can be categorized on the basis of whether the technique is designed for images or videos, single-person or multi-person scenarios, and monocular or multiview inputs. A comprehensive overview of recent research outcomes is essential to showcase the evolution of the research area, along with its underlying principles and varied application domains. Key benchmarks across these techniques are suitable and provide valuable insights into their performance. Hence, the paper summarizes these benchmarks, offering a comparative analysis of the techniques. As research in this field continues to evolve, it is critical for researchers to stay up to date with the latest developments and methodologies to promote further innovations in the field of pose estimation research. Therefore, this comprehensive overview presents a thorough examination of the subject matter, encompassing all pertinent details. Its objective is to equip researchers with the knowledge and resources necessary to investigate the topic and effectively retrieve all relevant information necessary for their investigations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104297"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0