Image and Vision Computing最新文献_第10页

FPDIoU Loss: A loss function for efficient bounding box regression of rotated object detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105381

Siliang Ma , Yong Xu

Bounding box regression is one of the important steps of object detection. However, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. Most of the existing loss functions for rotated object detection calculate the difference between two bounding boxes only focus on the deviation of area or each points distance (e.g.,

L_{S m o o t h - L 1}

,

L_{R o t a t e d I o U}

and

L_{P I o U}

). The calculation process of some loss functions is extremely complex (e.g.

L_{K F I o U}

). In order to improve the efficiency and accuracy of bounding box regression for rotated object detection, we proposed a novel metric for arbitrary shapes comparison based on minimum points distance, which takes most of the factors from existing loss functions for rotated object detection into account, i.e., the overlap or nonoverlapping area, the central points distance and the rotation angle. We also proposed a loss function called

L_{F P D I o U}

based on four points distance for accurate bounding box regression focusing on faster and high quality anchor boxes. In the experiments,

F P D I o U

loss has been applied to state-of-the-art rotated object detection (e.g., RTMDET, H2RBox) models training with three popular benchmarks of rotated object detection including DOTA, DIOR, HRSC2016 and two benchmarks of arbitrary orientation scene text detection including ICDAR 2017 RRC-MLT and ICDAR 2019 RRC-MLT, which achieves better performance than existing loss functions. The code is available at https://github.com/JacksonMa618/FPDIoU

{"title":"FPDIoU Loss: A loss function for efficient bounding box regression of rotated object detection","authors":"Siliang Ma , Yong Xu","doi":"10.1016/j.imavis.2024.105381","DOIUrl":"10.1016/j.imavis.2024.105381","url":null,"abstract":"<div><div>Bounding box regression is one of the important steps of object detection. However, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. Most of the existing loss functions for rotated object detection calculate the difference between two bounding boxes only focus on the deviation of area or each points distance (e.g., <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>S</mi><mi>m</mi><mi>o</mi><mi>o</mi><mi>t</mi><mi>h</mi><mo>−</mo><mi>L</mi><mn>1</mn></mrow></msub></math></span>, <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>R</mi><mi>o</mi><mi>t</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>d</mi><mi>I</mi><mi>o</mi><mi>U</mi></mrow></msub></math></span> and <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>P</mi><mi>I</mi><mi>o</mi><mi>U</mi></mrow></msub></math></span>). The calculation process of some loss functions is extremely complex (e.g. <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>K</mi><mi>F</mi><mi>I</mi><mi>o</mi><mi>U</mi></mrow></msub></math></span>). In order to improve the efficiency and accuracy of bounding box regression for rotated object detection, we proposed a novel metric for arbitrary shapes comparison based on minimum points distance, which takes most of the factors from existing loss functions for rotated object detection into account, i.e., the overlap or nonoverlapping area, the central points distance and the rotation angle. We also proposed a loss function called <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>F</mi><mi>P</mi><mi>D</mi><mi>I</mi><mi>o</mi><mi>U</mi></mrow></msub></math></span> based on four points distance for accurate bounding box regression focusing on faster and high quality anchor boxes. In the experiments, <span><math><mrow><mi>F</mi><mi>P</mi><mi>D</mi><mi>I</mi><mi>o</mi><mi>U</mi></mrow></math></span> loss has been applied to state-of-the-art rotated object detection (e.g., RTMDET, H2RBox) models training with three popular benchmarks of rotated object detection including DOTA, DIOR, HRSC2016 and two benchmarks of arbitrary orientation scene text detection including ICDAR 2017 RRC-MLT and ICDAR 2019 RRC-MLT, which achieves better performance than existing loss functions. The code is available at <span><span>https://github.com/JacksonMa618/FPDIoU</span><svg><path></path></svg></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105381"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GPLM: Enhancing underwater images with Global Pyramid Linear Modulation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105361

Jinxin Shao, Haosu Zhang, Jianming Miao

Underwater imagery often suffers from challenges such as color distortion, low contrast, blurring, and noise due to the absorption and scattering of light in water. These degradations complicate visual interpretation and hinder subsequent image processing. Existing methods struggle to effectively address the complex, spatially varying degradations without prior environmental knowledge or may produce unnatural enhancements. To overcome these limitations, we propose a novel method called Global Pyramid Linear Modulation that integrates physical degradation modeling with deep learning for underwater image enhancement. Our approach extends Feature-wise Linear Modulation to a four-dimensional structure, enabling fine-grained, spatially adaptive modulation of feature maps. Our method captures multi-scale contextual information by incorporating a feature pyramid architecture with self-attention and feature fusion mechanisms, effectively modeling complex degradations. We validate our method by integrating it into the MixDehazeNet model and conducting experiments on benchmark datasets. Our approach significantly improves the Peak Signal-to-Noise Ratio, increasing from 28.6 dB to 30.6 dB on the EUVP-515-test dataset. Compared to recent state-of-the-art methods, our method consistently outperforms them by over 3 dB in PSNR on datasets with ground truth. It improves the Underwater Image Quality Measure by more than one on datasets without ground truth. Furthermore, we demonstrate the practical applicability of our method on a real-world underwater dataset, achieving substantial improvements in image quality metrics and visually compelling results. These experiments confirm that our method effectively addresses the limitations of existing techniques by adaptively modeling complex underwater degradations, highlighting its potential for underwater image enhancement tasks.

{"title":"GPLM: Enhancing underwater images with Global Pyramid Linear Modulation","authors":"Jinxin Shao, Haosu Zhang, Jianming Miao","doi":"10.1016/j.imavis.2024.105361","DOIUrl":"10.1016/j.imavis.2024.105361","url":null,"abstract":"<div><div>Underwater imagery often suffers from challenges such as color distortion, low contrast, blurring, and noise due to the absorption and scattering of light in water. These degradations complicate visual interpretation and hinder subsequent image processing. Existing methods struggle to effectively address the complex, spatially varying degradations without prior environmental knowledge or may produce unnatural enhancements. To overcome these limitations, we propose a novel method called Global Pyramid Linear Modulation that integrates physical degradation modeling with deep learning for underwater image enhancement. Our approach extends Feature-wise Linear Modulation to a four-dimensional structure, enabling fine-grained, spatially adaptive modulation of feature maps. Our method captures multi-scale contextual information by incorporating a feature pyramid architecture with self-attention and feature fusion mechanisms, effectively modeling complex degradations. We validate our method by integrating it into the MixDehazeNet model and conducting experiments on benchmark datasets. Our approach significantly improves the Peak Signal-to-Noise Ratio, increasing from 28.6 dB to 30.6 dB on the EUVP-515-test dataset. Compared to recent state-of-the-art methods, our method consistently outperforms them by over 3 dB in PSNR on datasets with ground truth. It improves the Underwater Image Quality Measure by more than one on datasets without ground truth. Furthermore, we demonstrate the practical applicability of our method on a real-world underwater dataset, achieving substantial improvements in image quality metrics and visually compelling results. These experiments confirm that our method effectively addresses the limitations of existing techniques by adaptively modeling complex underwater degradations, highlighting its potential for underwater image enhancement tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105361"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRPE: Instance-level reconstruction-based 6D pose estimator

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105340

Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou

The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.

{"title":"IRPE: Instance-level reconstruction-based 6D pose estimator","authors":"Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou","doi":"10.1016/j.imavis.2024.105340","DOIUrl":"10.1016/j.imavis.2024.105340","url":null,"abstract":"<div><div>The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105340"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CFENet: Context-aware Feature Enhancement Network for efficient few-shot object counting

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105383

Shihui Zhang , Gangzheng Zhai , Kun Chen , Houlin Wang , Shaojie Han

Few-shot object counting (FSOC) is designed to estimate the number of objects in any category given a query image and several bounding boxes. Existing methods usually ignore shape information when extracting the appearance of exemplars from query images, resulting in reduced object localization accuracy and count estimates. Meanwhile, these methods also utilize a fixed inner product or convolution for similarity matching, which may introduce background interference and limit the matching of objects with significant intra-class differences. To address the above challenges, we propose a Context-aware Feature Enhancement Network (CFENet) for FSOC. Specifically, our network comprises three main modules: Hierarchical Perception Joint Enhancement Module (HPJEM), Learnable Similarity Matcher (LSM), and Feature Fusion Module (FFM). Firstly, HPJEM performs feature enhancement on the scale transformations of query images and the shapes of exemplars, improving the network’s ability to recognize dense objects. Secondly, LSM utilizes learnable dilated convolutions and linear layers to expand the similarity metric of a fixed inner product, obtaining similarity maps. Then convolution with a given kernel is performed on the similarity maps to get the weighted features. Finally, FFM further fuses weighted features with multi-scale features obtained by HPJEM. We conduct extensive experiments on the specialized few-shot dataset FSC-147 and the subsets Val-COCO and Test-COCO of the COCO dataset. Experimental results validate the effectiveness of our method and show competitive performance. To further verify the generalization of CFENet, we also conduct experiments on the car dataset CARPK.

{"title":"CFENet: Context-aware Feature Enhancement Network for efficient few-shot object counting","authors":"Shihui Zhang , Gangzheng Zhai , Kun Chen , Houlin Wang , Shaojie Han","doi":"10.1016/j.imavis.2024.105383","DOIUrl":"10.1016/j.imavis.2024.105383","url":null,"abstract":"<div><div>Few-shot object counting (FSOC) is designed to estimate the number of objects in any category given a query image and several bounding boxes. Existing methods usually ignore shape information when extracting the appearance of exemplars from query images, resulting in reduced object localization accuracy and count estimates. Meanwhile, these methods also utilize a fixed inner product or convolution for similarity matching, which may introduce background interference and limit the matching of objects with significant intra-class differences. To address the above challenges, we propose a Context-aware Feature Enhancement Network (CFENet) for FSOC. Specifically, our network comprises three main modules: Hierarchical Perception Joint Enhancement Module (HPJEM), Learnable Similarity Matcher (LSM), and Feature Fusion Module (FFM). Firstly, HPJEM performs feature enhancement on the scale transformations of query images and the shapes of exemplars, improving the network’s ability to recognize dense objects. Secondly, LSM utilizes learnable dilated convolutions and linear layers to expand the similarity metric of a fixed inner product, obtaining similarity maps. Then convolution with a given kernel is performed on the similarity maps to get the weighted features. Finally, FFM further fuses weighted features with multi-scale features obtained by HPJEM. We conduct extensive experiments on the specialized few-shot dataset FSC-147 and the subsets Val-COCO and Test-COCO of the COCO dataset. Experimental results validate the effectiveness of our method and show competitive performance. To further verify the generalization of CFENet, we also conduct experiments on the car dataset CARPK.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105383"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge guided and Fourier attention-based Dual Interaction Network for scene text erasing

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105406

Ran Gong, Anna Zhu, Kun Liu

Scene text erasing (STE) aims to remove the text regions and inpaint those regions with reasonable content in the image. It involves a potential task, i.e., scene text segmentation, in implicate or explicate ways. Most previous methods used cascaded or parallel pipelines to segment text in one branch and erase text in another branch. However, they have not fully explored the information between the two subtasks, i.e., using an interactive method to enhance each other. In this paper, we introduce a novel one-stage STE model called Dual Interaction Network (DINet), which encourages interaction between scene text segmentation and scene text erasing in an end-to-end manner. DINet adopts a shared encoder and two parallel decoders for text segmentation and erasing respectively. Specifically, the two decoders interact via an Interaction Enhancement Module (IEM) in each layer, aggregating the residual information from each other. To facilitate effective and efficient mutual enhancement between the dual tasks, we propose a novel Fourier Transform-based Attention Module (FTAM). In addition, we incorporate an Edge-Guided Module (EGM) into the text segmentation branch to better erase the text boundary regions and generate natural-looking images. Extensive experiments demonstrate that the DINet achieves state-of-the-art performances on several benchmarks. Furthermore, the ablation studies indicate the effectiveness and efficiency of our proposed modules in DINet.

{"title":"Edge guided and Fourier attention-based Dual Interaction Network for scene text erasing","authors":"Ran Gong, Anna Zhu, Kun Liu","doi":"10.1016/j.imavis.2024.105406","DOIUrl":"10.1016/j.imavis.2024.105406","url":null,"abstract":"<div><div>Scene text erasing (STE) aims to remove the text regions and inpaint those regions with reasonable content in the image. It involves a potential task, i.e., scene text segmentation, in implicate or explicate ways. Most previous methods used cascaded or parallel pipelines to segment text in one branch and erase text in another branch. However, they have not fully explored the information between the two subtasks, i.e., using an interactive method to enhance each other. In this paper, we introduce a novel one-stage STE model called Dual Interaction Network (DINet), which encourages interaction between scene text segmentation and scene text erasing in an end-to-end manner. DINet adopts a shared encoder and two parallel decoders for text segmentation and erasing respectively. Specifically, the two decoders interact via an Interaction Enhancement Module (IEM) in each layer, aggregating the residual information from each other. To facilitate effective and efficient mutual enhancement between the dual tasks, we propose a novel Fourier Transform-based Attention Module (FTAM). In addition, we incorporate an Edge-Guided Module (EGM) into the text segmentation branch to better erase the text boundary regions and generate natural-looking images. Extensive experiments demonstrate that the DINet achieves state-of-the-art performances on several benchmarks. Furthermore, the ablation studies indicate the effectiveness and efficiency of our proposed modules in DINet.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105406"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

1D kernel distillation network for efficient image super-resolution

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105411

Yusong Li, Longwei Xu, Weibin Yang, Dehua Geng, Mingyuan Xu, Zhiqi Dong, Pengwei Wang

Recently, there have been significant strides in single-image super-resolution, especially with the integration of transformers. However, the escalating computational demands of large models pose challenges for deployment on edge devices. Therefore, in pursuit of Efficient Image Super-Resolution (EISR), achieving a better balance between task computational complexity and image fidelity becomes imperative. In this paper, we introduce the 1D kernel distillation network (OKDN). Within this network, we have devised a lightweight 1D Large Kernel (OLK) block, incorporating a more lightweight yet highly effective attention mechanism. This block significantly expands the effective receptive field, enhancing performance while mitigating computational costs. Additionally, we develop a Channel Shift Enhanced Distillation (CSED) block to improve distillation efficiency, allocating more computational resources towards increasing network depth. We utilize methods involving partial channel shifting and global feature supervision (GFS) to further augment the effective receptive field. Furthermore, we introduce learnable Gaussian perturbation convolution (LGPConv) to enhance the model’s feature extraction and performance capabilities while upholding low computational complexity. Experimental results demonstrate that our proposed approach achieves superior results with significantly lower computational complexity compared to state-of-the-art models. The code is available at https://github.com/satvio/OKDN.

{"title":"1D kernel distillation network for efficient image super-resolution","authors":"Yusong Li, Longwei Xu, Weibin Yang, Dehua Geng, Mingyuan Xu, Zhiqi Dong, Pengwei Wang","doi":"10.1016/j.imavis.2024.105411","DOIUrl":"10.1016/j.imavis.2024.105411","url":null,"abstract":"<div><div>Recently, there have been significant strides in single-image super-resolution, especially with the integration of transformers. However, the escalating computational demands of large models pose challenges for deployment on edge devices. Therefore, in pursuit of Efficient Image Super-Resolution (EISR), achieving a better balance between task computational complexity and image fidelity becomes imperative. In this paper, we introduce the 1D kernel distillation network (OKDN). Within this network, we have devised a lightweight 1D Large Kernel (OLK) block, incorporating a more lightweight yet highly effective attention mechanism. This block significantly expands the effective receptive field, enhancing performance while mitigating computational costs. Additionally, we develop a Channel Shift Enhanced Distillation (CSED) block to improve distillation efficiency, allocating more computational resources towards increasing network depth. We utilize methods involving partial channel shifting and global feature supervision (GFS) to further augment the effective receptive field. Furthermore, we introduce learnable Gaussian perturbation convolution (LGPConv) to enhance the model’s feature extraction and performance capabilities while upholding low computational complexity. Experimental results demonstrate that our proposed approach achieves superior results with significantly lower computational complexity compared to state-of-the-art models. The code is available at <span><span>https://github.com/satvio/OKDN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105411"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A fast and lightweight train image fault detection model based on convolutional neural networks

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105380

Longxin Zhang, Wenliang Zeng, Peng Zhou, Xiaojun Deng, Jiayu Wu, Hong Wen

Trains play a vital role in the life of residents. Fault detection of trains is essential to ensuring their safe operation. Aiming at the problems of many parameters, slow detection speed, and low detection accuracy of the current train image fault detection model, a fast and lightweight train image fault detection model using convolutional neural network (FL-TINet) is proposed in this study. First, the joint depthwise separable convolution and divided-channel convolution strategy are applied to the feature extraction network in FL-TINet to reduce the number of parameters and computation amount in the backbone network, thereby increasing the detection speed. Second, a mixed attention mechanism is designed to make FL-TINet focus on key features. Finally, an improved discrete K-means clustering algorithm is designed to set the anchor boxes so that the anchor box can cover the object better, thereby improving the detection accuracy. Experimental results on PASCAL 2012 and train datasets show that FL-TINet can detect faults at 119 frames per second. Compared with the state-of-the-art CenterNet, RetinaNet, SSD, Faster R-CNN, MobileNet, YOLOv3, YOLOv4, YOLOv7-Tiny, YOLOv8_n and YOLOX-Tiny models, FL-TINet’s detection speed is increased by 96.37% on average, and it has higher detection accuracy and fewer parameters. The robustness test shows that FL-TINet can resist noise and illumination changes well.

{"title":"A fast and lightweight train image fault detection model based on convolutional neural networks","authors":"Longxin Zhang, Wenliang Zeng, Peng Zhou, Xiaojun Deng, Jiayu Wu, Hong Wen","doi":"10.1016/j.imavis.2024.105380","DOIUrl":"10.1016/j.imavis.2024.105380","url":null,"abstract":"<div><div>Trains play a vital role in the life of residents. Fault detection of trains is essential to ensuring their safe operation. Aiming at the problems of many parameters, slow detection speed, and low detection accuracy of the current train image fault detection model, a fast and lightweight train image fault detection model using convolutional neural network (FL-TINet) is proposed in this study. First, the joint depthwise separable convolution and divided-channel convolution strategy are applied to the feature extraction network in FL-TINet to reduce the number of parameters and computation amount in the backbone network, thereby increasing the detection speed. Second, a mixed attention mechanism is designed to make FL-TINet focus on key features. Finally, an improved discrete K-means clustering algorithm is designed to set the anchor boxes so that the anchor box can cover the object better, thereby improving the detection accuracy. Experimental results on PASCAL 2012 and train datasets show that FL-TINet can detect faults at 119 frames per second. Compared with the state-of-the-art CenterNet, RetinaNet, SSD, Faster R-CNN, MobileNet, YOLOv3, YOLOv4, YOLOv7-Tiny, YOLOv8_n and YOLOX-Tiny models, FL-TINet’s detection speed is increased by 96.37% on average, and it has higher detection accuracy and fewer parameters. The robustness test shows that FL-TINet can resist noise and illumination changes well.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105380"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EHGFormer: An efficient hypergraph-injected transformer for 3D human pose estimation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105425

Siyuan Zheng, Weiqun Cao

Recently, Transformer-based approaches have demonstrated remarkable success in 3D human pose estimation. However, these methods usually overlook crucial structural information inherent in human skeletal connections. In this paper, we propose a novel hypergraph-injected Transformer-based architecture(EHGFormer). The spatial feature extractor in our model decomposes joint relationships into first-order (joint-to-joint) and potential higher-order (joint-to-hyperedge) connections, and the attention mechanism of the spatial Transformer block, which integrates these relationships, forms the hypergraph-injected spatial attention. In addition, to address the trade-off between inference efficiency and estimation accuracy introduced by the hypergraph-injected spatial attention module, we design a multi-start grouped downsampling and restoration strategy. With this strategy, consistency in the sequence’s input and output order is maintained, while the temporal receptive field is expanded without requiring additional parameters. Furthermore, we propose a hierarchical feature distillation scheme, which applies different distillation strategies for tokens from various positions of the teacher network. This allows the narrower student network to selectively learn from the teacher network, yet improving its accuracy compared to existing feature distillation methods. Extensive experiments show that the proposed method achieves state-of-the-art performance on two benchmark datasets: Human3.6M and MPI-INF-3DHP. Code and models will be available at: https://github.com/Brian417-cup/EHGFormer.

{"title":"EHGFormer: An efficient hypergraph-injected transformer for 3D human pose estimation","authors":"Siyuan Zheng, Weiqun Cao","doi":"10.1016/j.imavis.2025.105425","DOIUrl":"10.1016/j.imavis.2025.105425","url":null,"abstract":"<div><div>Recently, Transformer-based approaches have demonstrated remarkable success in 3D human pose estimation. However, these methods usually overlook crucial structural information inherent in human skeletal connections. In this paper, we propose a novel hypergraph-injected Transformer-based architecture(EHGFormer). The spatial feature extractor in our model decomposes joint relationships into first-order (joint-to-joint) and potential higher-order (joint-to-hyperedge) connections, and the attention mechanism of the spatial Transformer block, which integrates these relationships, forms the hypergraph-injected spatial attention. In addition, to address the trade-off between inference efficiency and estimation accuracy introduced by the hypergraph-injected spatial attention module, we design a multi-start grouped downsampling and restoration strategy. With this strategy, consistency in the sequence’s input and output order is maintained, while the temporal receptive field is expanded without requiring additional parameters. Furthermore, we propose a hierarchical feature distillation scheme, which applies different distillation strategies for tokens from various positions of the teacher network. This allows the narrower student network to selectively learn from the teacher network, yet improving its accuracy compared to existing feature distillation methods. Extensive experiments show that the proposed method achieves state-of-the-art performance on two benchmark datasets: Human3.6M and MPI-INF-3DHP. Code and models will be available at: <span><span>https://github.com/Brian417-cup/EHGFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105425"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to estimate 3D interactive two-hand poses with attention perception

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105398

Wai Keung Wong , Hao Liang , Hongkun Sun , Weijun Sun , Haoliang Yuan , Shuping Zhao , Lunke Fei

3D hand pose estimation has attracted increasing research interest due to its broad real-world applications. While encouraging performance has been achieved in single-hand cases, 3D hand-pose estimation of two interactive hands from RGB images still faces two challenging problems: severe intra-hand and inter-hand occlusion and ill-posed projection from 2D hand images to 3D hand joints. To address this, in this paper, we propose a Decoupled Dual-branch Attention Network (DDANet) for 3D interactive two-hand pose estimation. First, we extract multiscale shallow feature maps via a ResNet backbone. Then, we simultaneously learn the context-aware 2D visual and 3D depth features of two interactive hands via two separate attention branches to extensively exploit the two-hand occluded semantic information from RGB images. After that, we define learnable feature vectors to perceive the 3D spatial positions of two-hand joints by interacting them with both 2D visual and 3D depth feature maps. In this way, ill-posed hand-joint positions can be characterized in 3D spaces. Furthermore, we refine the 3D hand-joint spatial positions by capturing the underlying hand-joint connections via GCN learning for 3D two-hand pose estimation. Experimental results on five public datasets show that the proposed DDANet outperforms most state-of-the-art methods with promising generalization.

{"title":"Learning to estimate 3D interactive two-hand poses with attention perception","authors":"Wai Keung Wong , Hao Liang , Hongkun Sun , Weijun Sun , Haoliang Yuan , Shuping Zhao , Lunke Fei","doi":"10.1016/j.imavis.2024.105398","DOIUrl":"10.1016/j.imavis.2024.105398","url":null,"abstract":"<div><div>3D hand pose estimation has attracted increasing research interest due to its broad real-world applications. While encouraging performance has been achieved in single-hand cases, 3D hand-pose estimation of two interactive hands from RGB images still faces two challenging problems: severe intra-hand and inter-hand occlusion and ill-posed projection from 2D hand images to 3D hand joints. To address this, in this paper, we propose a Decoupled Dual-branch Attention Network (DDANet) for 3D interactive two-hand pose estimation. First, we extract multiscale shallow feature maps via a ResNet backbone. Then, we simultaneously learn the context-aware 2D visual and 3D depth features of two interactive hands via two separate attention branches to extensively exploit the two-hand occluded semantic information from RGB images. After that, we define learnable feature vectors to perceive the 3D spatial positions of two-hand joints by interacting them with both 2D visual and 3D depth feature maps. In this way, ill-posed hand-joint positions can be characterized in 3D spaces. Furthermore, we refine the 3D hand-joint spatial positions by capturing the underlying hand-joint connections via GCN learning for 3D two-hand pose estimation. Experimental results on five public datasets show that the proposed DDANet outperforms most state-of-the-art methods with promising generalization.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105398"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Domain adaptive object detection via synthetically generated intermediate domain and progressive feature alignment

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105404

Ding Gao , Qian Wang , Jian Yang , Junlong Wu

The domain adaptive object detection problem is to accurately identify objects within varying target domains. The complexity arises from the discrepancies in weather conditions or diverse scenarios across different domains, which would significantly hinder the object detection model to generalize the learned knowledge from the source domain to the target domains. Currently, the teacher-student model with feature alignment is widely used to address this problem. However, most researchers only use the data from the source and target domains. To make the best use of the available data, we propose to generate the intermediate domain images by using a generative model and incorporate these images into the teacher-student model. The intermediate domain inherits the labels from the source domain and has a similar distribution to that of the target domain. To balance the influences of data from different domains on the model, we introduce the Progressive Feature Alignment (PFA) module. This strategy refines the feature alignment process. We align the source domain with the target domain by using a larger weight factor. For the intermediate domain, we use a lower weight factor for alignment with the target domain. The proposed method could significantly improve the performance of domain adaptive object detection as indicated in our experimental results: We achieve 47.9% mAP on Foggy Cityscape (from Cityscape), 63.2% AP on Cityscape (from Sim10k), and 50.6% AP on Cityscape (from KITTI).

{"title":"Domain adaptive object detection via synthetically generated intermediate domain and progressive feature alignment","authors":"Ding Gao , Qian Wang , Jian Yang , Junlong Wu","doi":"10.1016/j.imavis.2024.105404","DOIUrl":"10.1016/j.imavis.2024.105404","url":null,"abstract":"<div><div>The domain adaptive object detection problem is to accurately identify objects within varying target domains. The complexity arises from the discrepancies in weather conditions or diverse scenarios across different domains, which would significantly hinder the object detection model to generalize the learned knowledge from the source domain to the target domains. Currently, the teacher-student model with feature alignment is widely used to address this problem. However, most researchers only use the data from the source and target domains. To make the best use of the available data, we propose to generate the intermediate domain images by using a generative model and incorporate these images into the teacher-student model. The intermediate domain inherits the labels from the source domain and has a similar distribution to that of the target domain. To balance the influences of data from different domains on the model, we introduce the Progressive Feature Alignment (PFA) module. This strategy refines the feature alignment process. We align the source domain with the target domain by using a larger weight factor. For the intermediate domain, we use a lower weight factor for alignment with the target domain. The proposed method could significantly improve the performance of domain adaptive object detection as indicated in our experimental results: We achieve 47.9% mAP on Foggy Cityscape (from Cityscape), 63.2% AP on Cityscape (from Sim10k), and 50.6% AP on Cityscape (from KITTI).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105404"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0