首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Gait recognition via View-aware Part-wise Attention and Multi-scale Dilated Temporal Extractor 通过视图感知部分注意力和多尺度稀释时态提取器识别步态
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-20 DOI: 10.1016/j.imavis.2025.105464
Xu Song , Yang Wang , Yan Huang , Caifeng Shan
Gait recognition based on silhouette sequences has made significant strides in recent years through the extraction of body shape and motion features. However, challenges remain in achieving accurate gait recognition under covariate changes, such as variations in view and clothing. To tackle these issues, this paper introduces a novel methodology incorporating a View-aware Part-wise Attention (VPA) mechanism and a Multi-scale Dilated Temporal Extractor (MDTE) to enhance gait recognition. Distinct from existing techniques, VPA mechanism acknowledges the differential sensitivity of various body parts to view changes, applying targeted attention weights at the feature level to improve the efficacy of view-aware constraints in areas of higher saliency or distinctiveness. Concurrently, MDTE employs dilated convolutions across multiple scales to capture the temporal dynamics of gait at diverse levels, thereby refining the motion representation. Comprehensive experiments on the CASIA-B, OU-MVLP, and Gait3D datasets validate the superior performance of our approach. Remarkably, our method achieves a 91.0% accuracy rate under clothing-change conditions on the CASIA-B dataset using solely silhouette information, surpassing the current state-of-the-art (SOTA) techniques. These results underscore the effectiveness and adaptability of our proposed strategy in overcoming the complexities of gait recognition amidst covariate changes.
{"title":"Gait recognition via View-aware Part-wise Attention and Multi-scale Dilated Temporal Extractor","authors":"Xu Song ,&nbsp;Yang Wang ,&nbsp;Yan Huang ,&nbsp;Caifeng Shan","doi":"10.1016/j.imavis.2025.105464","DOIUrl":"10.1016/j.imavis.2025.105464","url":null,"abstract":"<div><div>Gait recognition based on silhouette sequences has made significant strides in recent years through the extraction of body shape and motion features. However, challenges remain in achieving accurate gait recognition under covariate changes, such as variations in view and clothing. To tackle these issues, this paper introduces a novel methodology incorporating a View-aware Part-wise Attention (VPA) mechanism and a Multi-scale Dilated Temporal Extractor (MDTE) to enhance gait recognition. Distinct from existing techniques, VPA mechanism acknowledges the differential sensitivity of various body parts to view changes, applying targeted attention weights at the feature level to improve the efficacy of view-aware constraints in areas of higher saliency or distinctiveness. Concurrently, MDTE employs dilated convolutions across multiple scales to capture the temporal dynamics of gait at diverse levels, thereby refining the motion representation. Comprehensive experiments on the CASIA-B, OU-MVLP, and Gait3D datasets validate the superior performance of our approach. Remarkably, our method achieves a 91.0% accuracy rate under clothing-change conditions on the CASIA-B dataset using solely silhouette information, surpassing the current state-of-the-art (SOTA) techniques. These results underscore the effectiveness and adaptability of our proposed strategy in overcoming the complexities of gait recognition amidst covariate changes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105464"},"PeriodicalIF":4.2,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143527562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FRoundation: Are foundation models ready for face recognition?
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-19 DOI: 10.1016/j.imavis.2025.105453
Tahar Chettaoui , Naser Damer , Fadi Boutros
Foundation models are predominantly trained in an unsupervised or self-supervised manner on highly diverse and large-scale datasets, making them broadly applicable to various downstream tasks. In this work, we investigate for the first time whether such models are suitable for the specific domain of face recognition (FR). We further propose and demonstrate the adaptation of these models for FR across different levels of data availability, including synthetic data. Extensive experiments are conducted on multiple foundation models and datasets of varying scales for training and fine-tuning, with evaluation on a wide range of benchmarks. Our results indicate that, despite their versatility, pre-trained foundation models tend to underperform in FR in comparison with similar architectures trained specifically for this task. However, fine-tuning foundation models yields promising results, often surpassing models trained from scratch, particularly when training data is limited. For example, after fine-tuning only on 1K identities, DINOv2 ViT-S achieved average verification accuracy on LFW, CALFW, CPLFW, CFP-FP, and AgeDB30 benchmarks of 87.10%, compared to 64.70% achieved by the same model and without fine-tuning. While training the same model architecture, ViT-S, from scratch on 1k identities reached 69.96%. With access to larger-scale FR training datasets, these performances reach 96.03% and 95.59% for the DINOv2 and CLIP ViT-L models, respectively. In comparison to the ViT-based architectures trained from scratch for FR, fine-tuned same architectures of foundation models achieve similar performance while requiring lower training computational costs and not relying on the assumption of extensive data availability. We further demonstrated the use of synthetic face data, showing improved performances over both pre-trained foundation and ViT models. Additionally, we examine demographic biases, noting slightly higher biases in certain settings when using foundation models compared to models trained from scratch. We release our code and pre-trained models’ weights at github.com/TaharChettaoui/FRoundation.
{"title":"FRoundation: Are foundation models ready for face recognition?","authors":"Tahar Chettaoui ,&nbsp;Naser Damer ,&nbsp;Fadi Boutros","doi":"10.1016/j.imavis.2025.105453","DOIUrl":"10.1016/j.imavis.2025.105453","url":null,"abstract":"<div><div>Foundation models are predominantly trained in an unsupervised or self-supervised manner on highly diverse and large-scale datasets, making them broadly applicable to various downstream tasks. In this work, we investigate for the first time whether such models are suitable for the specific domain of face recognition (FR). We further propose and demonstrate the adaptation of these models for FR across different levels of data availability, including synthetic data. Extensive experiments are conducted on multiple foundation models and datasets of varying scales for training and fine-tuning, with evaluation on a wide range of benchmarks. Our results indicate that, despite their versatility, pre-trained foundation models tend to underperform in FR in comparison with similar architectures trained specifically for this task. However, fine-tuning foundation models yields promising results, often surpassing models trained from scratch, particularly when training data is limited. For example, after fine-tuning only on 1K identities, DINOv2 ViT-S achieved average verification accuracy on LFW, CALFW, CPLFW, CFP-FP, and AgeDB30 benchmarks of 87.10%, compared to 64.70% achieved by the same model and without fine-tuning. While training the same model architecture, ViT-S, from scratch on 1k identities reached 69.96%. With access to larger-scale FR training datasets, these performances reach 96.03% and 95.59% for the DINOv2 and CLIP ViT-L models, respectively. In comparison to the ViT-based architectures trained from scratch for FR, fine-tuned same architectures of foundation models achieve similar performance while requiring lower training computational costs and not relying on the assumption of extensive data availability. We further demonstrated the use of synthetic face data, showing improved performances over both pre-trained foundation and ViT models. Additionally, we examine demographic biases, noting slightly higher biases in certain settings when using foundation models compared to models trained from scratch. We release our code and pre-trained models’ weights at <span><span>github.com/TaharChettaoui/FRoundation</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105453"},"PeriodicalIF":4.2,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vehicle re-identification with large separable kernel attention and hybrid channel attention
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-17 DOI: 10.1016/j.imavis.2025.105442
Xuezhi Xiang , Zhushan Ma , Xiaoheng Li , Lei Zhang , Xiantong Zhen
With the rapid development of intelligent transportation systems and the popularity of smart city infrastructure, Vehicle Re-ID technology has become an important research field. The vehicle Re-ID task faces an important challenge, which is the high similarity between different vehicles. Existing methods use additional detection or segmentation models to extract differentiated local features. However, these methods either rely on additional annotations or greatly increase the computational cost. Using attention mechanism to capture global and local features is crucial to solve the challenge of high similarity between classes in vehicle Re-ID tasks. In this paper, we propose LSKA-ReID with large separable kernel attention and hybrid channel attention. Specifically, the large separable kernel attention (LSKA) utilizes the advantages of self-attention and also benefits from the advantages of convolution, which can extract the global and local features of the vehicle more comprehensively. We also compare the performance of LSKA and large kernel attention (LKA) on the vehicle ReID task. We also introduce hybrid channel attention (HCA), which combines channel attention with spatial information, so that the model can better focus on channels and feature regions, and ignore background and other disturbing information. Extensive experiments on three popular datasets VeRi-776, VehicleID and VERI-Wild demonstrate the effectiveness of LSKA-ReID. In particular, on VeRi-776 dataset, mAP reaches 86.78% and Rank-1 reaches 98.09%.
{"title":"Vehicle re-identification with large separable kernel attention and hybrid channel attention","authors":"Xuezhi Xiang ,&nbsp;Zhushan Ma ,&nbsp;Xiaoheng Li ,&nbsp;Lei Zhang ,&nbsp;Xiantong Zhen","doi":"10.1016/j.imavis.2025.105442","DOIUrl":"10.1016/j.imavis.2025.105442","url":null,"abstract":"<div><div>With the rapid development of intelligent transportation systems and the popularity of smart city infrastructure, Vehicle Re-ID technology has become an important research field. The vehicle Re-ID task faces an important challenge, which is the high similarity between different vehicles. Existing methods use additional detection or segmentation models to extract differentiated local features. However, these methods either rely on additional annotations or greatly increase the computational cost. Using attention mechanism to capture global and local features is crucial to solve the challenge of high similarity between classes in vehicle Re-ID tasks. In this paper, we propose LSKA-ReID with large separable kernel attention and hybrid channel attention. Specifically, the large separable kernel attention (LSKA) utilizes the advantages of self-attention and also benefits from the advantages of convolution, which can extract the global and local features of the vehicle more comprehensively. We also compare the performance of LSKA and large kernel attention (LKA) on the vehicle ReID task. We also introduce hybrid channel attention (HCA), which combines channel attention with spatial information, so that the model can better focus on channels and feature regions, and ignore background and other disturbing information. Extensive experiments on three popular datasets VeRi-776, VehicleID and VERI-Wild demonstrate the effectiveness of LSKA-ReID. In particular, on VeRi-776 dataset, mAP reaches 86.78% and Rank-1 reaches 98.09%.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105442"},"PeriodicalIF":4.2,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Innovative underwater image enhancement algorithm: Combined application of adaptive white balance color compensation and pyramid image fusion to submarine algal microscopy
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-16 DOI: 10.1016/j.imavis.2025.105466
Yi-Ning Fan , Geng-Kun Wu , Jia-Zheng Han , Bei-Ping Zhang , Jie Xu
Real-time collected microscopic images of harmful algal blooms (HABs) in coastal areas often suffer from significant color deviations and loss of fine cellular details. To address these issues, this paper proposes an innovative method for enhancing underwater marine algal microscopic images based on Adaptive White Balance Color Compensation (AWBCC) and Image Pyramid Fusion (IPF). Firstly, an effective Algorithm Adaptive Cyclic Channel Compensation (ACCC) is proposed based on the gray world assumption to enhance the color of underwater images. Then, the Maximum Color Channel Attention Guidance (MCCAG) method is employed to reduce color disturbance caused by ignoring light absorption. This paper introduces an Empirical Contrast Enhancement (ECH) module based on multi-scale IPF tailored for underwater microscopic images of algae, which is used for global contrast enhancement, texture detail enhancement, and noise control. Secondly, this paper proposes a network based on a diffusion probability model for edge detection in HABs, which simultaneously considers both high-order and low-order features extracted from images. This approach enriches the semantic information of the feature maps and enhances edge detection accuracy. This edge detection method achieves an ODS of 0.623 and an OIS of 0.683. Experimental evaluations demonstrate that our underwater algae microscopic image enhancement method amplifies local texture features while preserving the original image structure. This significantly improves the accuracy of edge detection and key point matching. Compared to several state-of-the-art underwater image enhancement methods, our approach achieves the highest values in contrast, average gradient, entropy, and Enhancement Measure Estimation (EME), and also delivers competitive results in terms of image noise control. .
{"title":"Innovative underwater image enhancement algorithm: Combined application of adaptive white balance color compensation and pyramid image fusion to submarine algal microscopy","authors":"Yi-Ning Fan ,&nbsp;Geng-Kun Wu ,&nbsp;Jia-Zheng Han ,&nbsp;Bei-Ping Zhang ,&nbsp;Jie Xu","doi":"10.1016/j.imavis.2025.105466","DOIUrl":"10.1016/j.imavis.2025.105466","url":null,"abstract":"<div><div>Real-time collected microscopic images of harmful algal blooms (HABs) in coastal areas often suffer from significant color deviations and loss of fine cellular details. To address these issues, this paper proposes an innovative method for enhancing underwater marine algal microscopic images based on Adaptive White Balance Color Compensation (AWBCC) and Image Pyramid Fusion (IPF). Firstly, an effective Algorithm Adaptive Cyclic Channel Compensation (ACCC) is proposed based on the gray world assumption to enhance the color of underwater images. Then, the Maximum Color Channel Attention Guidance (MCCAG) method is employed to reduce color disturbance caused by ignoring light absorption. This paper introduces an Empirical Contrast Enhancement (ECH) module based on multi-scale IPF tailored for underwater microscopic images of algae, which is used for global contrast enhancement, texture detail enhancement, and noise control. Secondly, this paper proposes a network based on a diffusion probability model for edge detection in HABs, which simultaneously considers both high-order and low-order features extracted from images. This approach enriches the semantic information of the feature maps and enhances edge detection accuracy. This edge detection method achieves an ODS of 0.623 and an OIS of 0.683. Experimental evaluations demonstrate that our underwater algae microscopic image enhancement method amplifies local texture features while preserving the original image structure. This significantly improves the accuracy of edge detection and key point matching. Compared to several state-of-the-art underwater image enhancement methods, our approach achieves the highest values in contrast, average gradient, entropy, and Enhancement Measure Estimation (EME), and also delivers competitive results in terms of image noise control.<!--> <!-->.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105466"},"PeriodicalIF":4.2,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143471218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-modal multiscale feature cross fusion for hyperspectral unmixing
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-15 DOI: 10.1016/j.imavis.2025.105445
Senlong Qin, Yuqi Hao, Minghui Chu, Xiaodong Yu
Hyperspectral images (HSI) possess rich spectral characteristics but suffer from low spatial resolution, which has led many methods to focus on extracting more spatial information from HSI. However, the spatial information that can be extracted from a single HSI is limited, making it difficult to distinguish objects with similar materials. To address this issue, we propose a multimodal unmixing network called MSFF-Net. This network enhances unmixing performance by integrating the spatial information from light detection and ranging (LiDAR) data into the unmixing process. To ensure a more comprehensive fusion of features from the two modalities, we introduce a multi-scale cross-fusion method, providing a new approach to multimodal data fusion. Additionally, the network employs attention mechanisms to enhance channel-wise and spatial features, boosting the model's representational capacity. Our proposed model effectively consolidates multimodal information, significantly improving its unmixing capability, especially in complex environments, leading to more accurate unmixing results and facilitating further analysis of HSI. We evaluate our method using two real-world datasets. Experimental results demonstrate that our proposed approach outperforms other state-of-the-art methods in terms of both stability and effectiveness.
{"title":"Two-modal multiscale feature cross fusion for hyperspectral unmixing","authors":"Senlong Qin,&nbsp;Yuqi Hao,&nbsp;Minghui Chu,&nbsp;Xiaodong Yu","doi":"10.1016/j.imavis.2025.105445","DOIUrl":"10.1016/j.imavis.2025.105445","url":null,"abstract":"<div><div>Hyperspectral images (HSI) possess rich spectral characteristics but suffer from low spatial resolution, which has led many methods to focus on extracting more spatial information from HSI. However, the spatial information that can be extracted from a single HSI is limited, making it difficult to distinguish objects with similar materials. To address this issue, we propose a multimodal unmixing network called MSFF-Net. This network enhances unmixing performance by integrating the spatial information from light detection and ranging (LiDAR) data into the unmixing process. To ensure a more comprehensive fusion of features from the two modalities, we introduce a multi-scale cross-fusion method, providing a new approach to multimodal data fusion. Additionally, the network employs attention mechanisms to enhance channel-wise and spatial features, boosting the model's representational capacity. Our proposed model effectively consolidates multimodal information, significantly improving its unmixing capability, especially in complex environments, leading to more accurate unmixing results and facilitating further analysis of HSI. We evaluate our method using two real-world datasets. Experimental results demonstrate that our proposed approach outperforms other state-of-the-art methods in terms of both stability and effectiveness.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105445"},"PeriodicalIF":4.2,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive robot task sequencing through real-time hand motion prediction in human–robot collaboration
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-15 DOI: 10.1016/j.imavis.2025.105443
Shyngyskhan Abilkassov , Michael Gentner , Almas Shintemirov , Eckehard Steinbach , Mirela Popa
Human–robot collaboration (HRC) is essential for improving productivity and safety across various industries. While reactive motion re-planning strategies are useful, there is a growing demand for proactive methods that predict human intentions to enable more efficient collaboration. This study addresses this need by introducing a framework that combines deep learning-based human hand trajectory forecasting with heuristic optimization for robotic task sequencing. The deep learning model advances real-time hand position forecasting using a multi-task learning loss to account for both hand positions and contact delay regression, achieving state-of-the-art performance on the Ego4D Future Hand Prediction benchmark. By integrating hand trajectory predictions into task planning, the framework offers a cohesive solution for HRC. To optimize task sequencing, the framework incorporates a Dynamic Variable Neighborhood Search (DynamicVNS) heuristic algorithm, which allows robots to pre-plan task sequences and avoid potential collisions with human hand positions. DynamicVNS provides significant computational advantages over the generalized VNS method. The framework was validated on a UR10e robot performing a visual inspection task in a HRC scenario, where the robot effectively anticipated and responded to human hand movements in a shared workspace. Experimental results highlight the system’s effectiveness and potential to enhance HRC in industrial settings by combining predictive accuracy and task planning efficiency.
{"title":"Proactive robot task sequencing through real-time hand motion prediction in human–robot collaboration","authors":"Shyngyskhan Abilkassov ,&nbsp;Michael Gentner ,&nbsp;Almas Shintemirov ,&nbsp;Eckehard Steinbach ,&nbsp;Mirela Popa","doi":"10.1016/j.imavis.2025.105443","DOIUrl":"10.1016/j.imavis.2025.105443","url":null,"abstract":"<div><div>Human–robot collaboration (HRC) is essential for improving productivity and safety across various industries. While reactive motion re-planning strategies are useful, there is a growing demand for proactive methods that predict human intentions to enable more efficient collaboration. This study addresses this need by introducing a framework that combines deep learning-based human hand trajectory forecasting with heuristic optimization for robotic task sequencing. The deep learning model advances real-time hand position forecasting using a multi-task learning loss to account for both hand positions and contact delay regression, achieving state-of-the-art performance on the Ego4D Future Hand Prediction benchmark. By integrating hand trajectory predictions into task planning, the framework offers a cohesive solution for HRC. To optimize task sequencing, the framework incorporates a Dynamic Variable Neighborhood Search (DynamicVNS) heuristic algorithm, which allows robots to pre-plan task sequences and avoid potential collisions with human hand positions. DynamicVNS provides significant computational advantages over the generalized VNS method. The framework was validated on a UR10e robot performing a visual inspection task in a HRC scenario, where the robot effectively anticipated and responded to human hand movements in a shared workspace. Experimental results highlight the system’s effectiveness and potential to enhance HRC in industrial settings by combining predictive accuracy and task planning efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105443"},"PeriodicalIF":4.2,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMASR: Lightweight image super-resolution with cluster and match attention
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-14 DOI: 10.1016/j.imavis.2025.105457
Detian Huang , Mingxin Lin , Hang Liu , Huanqiang Zeng
The Transformer has recently achieved impressive success in image super-resolution due to its ability to model long-range dependencies with multi-head self-attention (MHSA). However, most existing MHSAs focus only on the dependencies among individual tokens, and ignore the ones among token clusters containing several tokens, resulting in the inability of Transformer to adequately explore global features. On the other hand, Transformer neglects local features, which inevitably hinders accurate detail reconstruction. To address the above issues, we propose a lightweight image super-resolution method with cluster and match attention (CMASR). Specifically, a token Clustering block is designed to divide input tokens into token clusters of different sizes with depthwise separable convolution. Subsequently, we propose an efficient axial matching self-attention (AMSA) mechanism, which introduces an axial matrix to extract local features, including axial similarities and symmetries. Further, by combining AMSA and Window Self-Attention, we construct a Hybrid Self-Attention block to capture the dependencies among token clusters of different sizes to sufficiently extract axial local features and global features. Extensive experiments demonstrate that the proposed CMASR outperforms state-of-the-art methods with fewer computational cost (i.e., the number of parameters and FLOPs).
{"title":"CMASR: Lightweight image super-resolution with cluster and match attention","authors":"Detian Huang ,&nbsp;Mingxin Lin ,&nbsp;Hang Liu ,&nbsp;Huanqiang Zeng","doi":"10.1016/j.imavis.2025.105457","DOIUrl":"10.1016/j.imavis.2025.105457","url":null,"abstract":"<div><div>The Transformer has recently achieved impressive success in image super-resolution due to its ability to model long-range dependencies with multi-head self-attention (MHSA). However, most existing MHSAs focus only on the dependencies among individual tokens, and ignore the ones among token clusters containing several tokens, resulting in the inability of Transformer to adequately explore global features. On the other hand, Transformer neglects local features, which inevitably hinders accurate detail reconstruction. To address the above issues, we propose a lightweight image super-resolution method with cluster and match attention (CMASR). Specifically, a token Clustering block is designed to divide input tokens into token clusters of different sizes with depthwise separable convolution. Subsequently, we propose an efficient axial matching self-attention (AMSA) mechanism, which introduces an axial matrix to extract local features, including axial similarities and symmetries. Further, by combining AMSA and Window Self-Attention, we construct a Hybrid Self-Attention block to capture the dependencies among token clusters of different sizes to sufficiently extract axial local features and global features. Extensive experiments demonstrate that the proposed CMASR outperforms state-of-the-art methods with fewer computational cost (i.e., the number of parameters and FLOPs).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105457"},"PeriodicalIF":4.2,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FGS-NeRF: A fast glossy surface reconstruction method based on voxel and reflection directions
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-14 DOI: 10.1016/j.imavis.2025.105455
Han Hong , Qing Ye , Keyun Xiong , Qing Tao , Yiqian Wan
Neural surface reconstruction technology has great potential for recovering 3D surfaces from multiview images. However, surface gloss can severely affect the reconstruction quality. Although existing methods address the issue of glossy surface reconstruction, achieving rapid reconstruction remains a challenge. While DVGO can achieve rapid scene geometry search, it tends to create numerous holes in glossy surfaces during the search process. To address this, we design a geometry search method based on SDF and reflection directions, employing a method called progressive voxel-MLP scaling to achieve accurate and efficient geometry searches for glossy scenes. To mitigate object edge artifacts caused by reflection directions, we use a simple loss function called sigmoid RGB loss, which helps reduce artifacts around objects during the early stages of training and promotes efficient surface convergence. In this work, we introduce the FGS-NeRF model, which uses a coarse-to-fine training method combined with reflection directions to achieve rapid reconstruction of glossy object surfaces based on voxel grids. The training time on a single RTX 4080 GPU is 20 min. Evaluations on the Shiny Blender and Smart Car datasets confirm that our model significantly improves the speed when compared with existing glossy object reconstruction methods while achieving accurate object surfaces. Code: https://github.com/yosugahhh/FGS-nerf.
{"title":"FGS-NeRF: A fast glossy surface reconstruction method based on voxel and reflection directions","authors":"Han Hong ,&nbsp;Qing Ye ,&nbsp;Keyun Xiong ,&nbsp;Qing Tao ,&nbsp;Yiqian Wan","doi":"10.1016/j.imavis.2025.105455","DOIUrl":"10.1016/j.imavis.2025.105455","url":null,"abstract":"<div><div>Neural surface reconstruction technology has great potential for recovering 3D surfaces from multiview images. However, surface gloss can severely affect the reconstruction quality. Although existing methods address the issue of glossy surface reconstruction, achieving rapid reconstruction remains a challenge. While DVGO can achieve rapid scene geometry search, it tends to create numerous holes in glossy surfaces during the search process. To address this, we design a geometry search method based on SDF and reflection directions, employing a method called progressive voxel-MLP scaling to achieve accurate and efficient geometry searches for glossy scenes. To mitigate object edge artifacts caused by reflection directions, we use a simple loss function called sigmoid RGB loss, which helps reduce artifacts around objects during the early stages of training and promotes efficient surface convergence. In this work, we introduce the FGS-NeRF model, which uses a coarse-to-fine training method combined with reflection directions to achieve rapid reconstruction of glossy object surfaces based on voxel grids. The training time on a single RTX 4080 GPU is 20 min. Evaluations on the Shiny Blender and Smart Car datasets confirm that our model significantly improves the speed when compared with existing glossy object reconstruction methods while achieving accurate object surfaces. Code: <span><span>https://github.com/yosugahhh/FGS-nerf</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105455"},"PeriodicalIF":4.2,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-13 DOI: 10.1016/j.imavis.2025.105456
Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang
Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-5i and COCO-20i prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at https://github.com/Jiaguang-NEU/ESDA.
{"title":"ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy","authors":"Jiaguang Li,&nbsp;Ying Wei,&nbsp;Wei Zhang,&nbsp;Chuyuan Wang","doi":"10.1016/j.imavis.2025.105456","DOIUrl":"10.1016/j.imavis.2025.105456","url":null,"abstract":"<div><div>Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at <span><span>https://github.com/Jiaguang-NEU/ESDA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105456"},"PeriodicalIF":4.2,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143421918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic consistency learning for unsupervised multi-modal person re-identification
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-13 DOI: 10.1016/j.imavis.2025.105434
Yuxin Zhang, Zhu Teng, Baopeng Zhang
Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.
{"title":"Semantic consistency learning for unsupervised multi-modal person re-identification","authors":"Yuxin Zhang,&nbsp;Zhu Teng,&nbsp;Baopeng Zhang","doi":"10.1016/j.imavis.2025.105434","DOIUrl":"10.1016/j.imavis.2025.105434","url":null,"abstract":"<div><div>Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105434"},"PeriodicalIF":4.2,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1