In traffic scenarios, the size of targets varies significantly, and there is a limitation on computing power. This poses a significant challenge for algorithms to detect traffic targets accurately. This paper proposes a new traffic target detection method that balances accuracy and real-time performance—Deep and Filtered You Only Look Once (DF-YOLO). In response to the challenges posed by significant differences in target scales within complex scenes, we designed the Deep and Filtered Path Aggregation Network (DF-PAN). This module effectively fuses multi-scale features, enhancing the model's capability to detect multi-scale targets accurately. In response to the challenge posed by limited computational resources, we design a parameter-sharing detection head (PSD) and use Faster Neural Network (FasterNet) as the backbone network. PSD reduces computational load by parameter sharing and allows for feature extraction capability sharing across different positions. FasterNet enhances memory access efficiency, thereby maximizing computational resource utilization. The experimental results on the KITTI dataset show that our method achieves satisfactory balances between real-time and precision and reaches 90.9% mean average precision(mAP) with 77 frames/s, and the number of parameters is reduced by 28.1% and the detection accuracy is increased by 3% compared to the baseline model. We test it on the challenging BDD100K dataset and the SODA10M dataset, and the results show that DF-YOLO has excellent generalization ability.
{"title":"High-precision real-time autonomous driving target detection based on YOLOv8","authors":"Huixin Liu, Guohua Lu, Mingxi Li, Weihua Su, Ziyi Liu, Xu Dang, Dongyuan Zang","doi":"10.1007/s11554-024-01553-2","DOIUrl":"https://doi.org/10.1007/s11554-024-01553-2","url":null,"abstract":"<p>In traffic scenarios, the size of targets varies significantly, and there is a limitation on computing power. This poses a significant challenge for algorithms to detect traffic targets accurately. This paper proposes a new traffic target detection method that balances accuracy and real-time performance—Deep and Filtered You Only Look Once (DF-YOLO). In response to the challenges posed by significant differences in target scales within complex scenes, we designed the Deep and Filtered Path Aggregation Network (DF-PAN). This module effectively fuses multi-scale features, enhancing the model's capability to detect multi-scale targets accurately. In response to the challenge posed by limited computational resources, we design a parameter-sharing detection head (PSD) and use Faster Neural Network (FasterNet) as the backbone network. PSD reduces computational load by parameter sharing and allows for feature extraction capability sharing across different positions. FasterNet enhances memory access efficiency, thereby maximizing computational resource utilization. The experimental results on the KITTI dataset show that our method achieves satisfactory balances between real-time and precision and reaches 90.9% mean average precision(mAP) with 77 frames/s, and the number of parameters is reduced by 28.1% and the detection accuracy is increased by 3% compared to the baseline model. We test it on the challenging BDD100K dataset and the SODA10M dataset, and the results show that DF-YOLO has excellent generalization ability.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"15 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-13DOI: 10.1007/s11554-024-01551-4
Yu Wang, Xiaodong Xiang
The disordered arrangement of water-meter pipes and the random rotation angles of their mechanical character wheels frequently result in captured water-meter images exhibiting tilt, blur, and incomplete characters. These issues complicate the detection of water-meter images, rendering traditional OCR (optical character recognition) methods inadequate for current detection requirements. Furthermore, the two-stage detection method, which involves first locating and then recognizing, proves overly cumbersome. In this paper, water-meter reading recognition is approached as an object-detection task, extracting readings using the algorithm’s Predicted Box information, establishing a water-meter dataset, and refining the algorithmic framework to improve the accuracy of recognizing incomplete characters. Utilizing YOLOv8n as the baseline, we propose GMS-YOLO, a novel object-detection algorithm that employs Grouped Multi-Scale Convolution for enhanced performance. First, by substituting the Bottleneck module’s convolution with GMSC (Grouped Multi-Scale Convolution), the model can access various scale receptive fields, thus boosting its feature-extraction prowess. Second, incorporating LSKA (Large Kernel Separable Attention) into the SPPF (Spatial Pyramid Pooling Fast) module improves the perception of fine-grained features. Finally, replacing CIoU (Generalized Intersection over Union) with the ShapeIoU bounding box loss function enhances the model’s ability to localize objects and speeds up its convergence. Evaluating a self-compiled water-meter image dataset, GMS-YOLO attained a mAP@0.5 of 92.4% and a precision of 93.2%, marking a 2.0% and 2.1% enhancement over YOLOv8n, respectively. Despite the increased computational burden, GMS-YOLO maintains an average detection time of 10 ms per image, meeting practical detection needs.
{"title":"GMS-YOLO: an enhanced algorithm for water meter reading recognition in complex environments","authors":"Yu Wang, Xiaodong Xiang","doi":"10.1007/s11554-024-01551-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01551-4","url":null,"abstract":"<p>The disordered arrangement of water-meter pipes and the random rotation angles of their mechanical character wheels frequently result in captured water-meter images exhibiting tilt, blur, and incomplete characters. These issues complicate the detection of water-meter images, rendering traditional OCR (optical character recognition) methods inadequate for current detection requirements. Furthermore, the two-stage detection method, which involves first locating and then recognizing, proves overly cumbersome. In this paper, water-meter reading recognition is approached as an object-detection task, extracting readings using the algorithm’s Predicted Box information, establishing a water-meter dataset, and refining the algorithmic framework to improve the accuracy of recognizing incomplete characters. Utilizing YOLOv8n as the baseline, we propose GMS-YOLO, a novel object-detection algorithm that employs Grouped Multi-Scale Convolution for enhanced performance. First, by substituting the Bottleneck module’s convolution with GMSC (Grouped Multi-Scale Convolution), the model can access various scale receptive fields, thus boosting its feature-extraction prowess. Second, incorporating LSKA (Large Kernel Separable Attention) into the SPPF (Spatial Pyramid Pooling Fast) module improves the perception of fine-grained features. Finally, replacing CIoU (Generalized Intersection over Union) with the ShapeIoU bounding box loss function enhances the model’s ability to localize objects and speeds up its convergence. Evaluating a self-compiled water-meter image dataset, GMS-YOLO attained a mAP@0.5 of 92.4% and a precision of 93.2%, marking a 2.0% and 2.1% enhancement over YOLOv8n, respectively. Despite the increased computational burden, GMS-YOLO maintains an average detection time of 10 ms per image, meeting practical detection needs.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"9 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1007/s11554-024-01552-3
Heng Chen, Xiaofeng Huang, Zehao Tao, Qinghua Sheng, Yan Cui, Yang Zhou, Haibing Yin
To enhance compression efficiency, the AV1 video coding standard has introduced several new intra-prediction modes, such as smooth and finer directional prediction modes. However, this addition increases computational complexity and hinders parallelized hardware implementation. In this paper, a hardware-friendly rough mode decision (RMD) algorithm and its fully pipelined hardware architecture design are proposed to address these challenges. For algorithm optimization, firstly, a novel directional mode pruning algorithm is proposed. Then, the sum of absolute transform differences (SATD) cost accumulated approximation method is adopted during the tree search. Finally, in the reconstruction stage, a reconstruction approximation model based on the DC transform is proposed to solve the low-parallelism problem. For hardware architecture design, the proposed fully pipelined hardware architecture is implemented with 28 pipeline stages. This design can process multiple prediction modes in parallel. Experimental results show that the proposed fast algorithm achieves 46.8% time savings by 1.96% Bjøntegaard delta rate (BD-Rate) increase on average under all-intra (AI) configuration. When synthesized under the 28nm UMC technology, the proposed hardware can operate at a frequency of 316.2 MHz with 1113.14 K gate count.
{"title":"Fast rough mode decision algorithm and hardware architecture design for AV1 encoder","authors":"Heng Chen, Xiaofeng Huang, Zehao Tao, Qinghua Sheng, Yan Cui, Yang Zhou, Haibing Yin","doi":"10.1007/s11554-024-01552-3","DOIUrl":"https://doi.org/10.1007/s11554-024-01552-3","url":null,"abstract":"<p>To enhance compression efficiency, the AV1 video coding standard has introduced several new intra-prediction modes, such as smooth and finer directional prediction modes. However, this addition increases computational complexity and hinders parallelized hardware implementation. In this paper, a hardware-friendly rough mode decision (RMD) algorithm and its fully pipelined hardware architecture design are proposed to address these challenges. For algorithm optimization, firstly, a novel directional mode pruning algorithm is proposed. Then, the sum of absolute transform differences (SATD) cost accumulated approximation method is adopted during the tree search. Finally, in the reconstruction stage, a reconstruction approximation model based on the DC transform is proposed to solve the low-parallelism problem. For hardware architecture design, the proposed fully pipelined hardware architecture is implemented with 28 pipeline stages. This design can process multiple prediction modes in parallel. Experimental results show that the proposed fast algorithm achieves 46.8% time savings by 1.96% Bjøntegaard delta rate (BD-Rate) increase on average under all-intra (AI) configuration. When synthesized under the 28nm UMC technology, the proposed hardware can operate at a frequency of 316.2 MHz with 1113.14 K gate count.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"62 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1007/s11554-024-01546-1
Xinghai Jia, Chao Ji, Fan Zhang, Junpeng Liu, Mingjiang Gao, Xinbo Huang
With the expansion of power transmission line scale, the surrounding environment is complex and susceptible to foreign objects, severely threatening its safe operation. The current algorithm lacks stability and real-time performance in small target detection and severe weather conditions. Therefore, this paper proposes a method for detecting foreign objects on power transmission lines under severe weather conditions based on AdaptoMixNet. First, an Adaptive Fusion Module (AFM) is introduced, which improves the model's accuracy and adaptability through multi-scale feature extraction, fine-grained information preservation, and enhancing context information. Second, an Adaptive Feature Pyramid Module (AEFPM) is proposed, which enhances the focus on local details while preserving global information, improving the stability and robustness of feature representation. Finally, the Neuron Expansion Recursion Adaptive Filter (CARAFE) is designed, which enhances feature extraction, adaptive filtering, and recursive mechanisms, improving detection accuracy, robustness, and computational efficiency. Experimental results show that the method of this paper exhibits excellent performance in the detection of foreign objects on power transmission lines under complex backgrounds and harsh weather conditions.
{"title":"AdaptoMixNet: detection of foreign objects on power transmission lines under severe weather conditions","authors":"Xinghai Jia, Chao Ji, Fan Zhang, Junpeng Liu, Mingjiang Gao, Xinbo Huang","doi":"10.1007/s11554-024-01546-1","DOIUrl":"https://doi.org/10.1007/s11554-024-01546-1","url":null,"abstract":"<p>With the expansion of power transmission line scale, the surrounding environment is complex and susceptible to foreign objects, severely threatening its safe operation. The current algorithm lacks stability and real-time performance in small target detection and severe weather conditions. Therefore, this paper proposes a method for detecting foreign objects on power transmission lines under severe weather conditions based on AdaptoMixNet. First, an Adaptive Fusion Module (AFM) is introduced, which improves the model's accuracy and adaptability through multi-scale feature extraction, fine-grained information preservation, and enhancing context information. Second, an Adaptive Feature Pyramid Module (AEFPM) is proposed, which enhances the focus on local details while preserving global information, improving the stability and robustness of feature representation. Finally, the Neuron Expansion Recursion Adaptive Filter (CARAFE) is designed, which enhances feature extraction, adaptive filtering, and recursive mechanisms, improving detection accuracy, robustness, and computational efficiency. Experimental results show that the method of this paper exhibits excellent performance in the detection of foreign objects on power transmission lines under complex backgrounds and harsh weather conditions.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"19 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid expansion of the automotive industry and the continuous growth of vehicle fleets, traffic safety has become a critical global social issue. Developing detection and alert systems for fatigue and distracted driving is essential for enhancing traffic safety. Factors, such as variations in the driver’s facial details, lighting conditions, and camera pixel quality, significantly affect the accuracy of fatigue and distracted driving detection, often resulting in the low effectiveness of existing methods. This study introduces a new network designed to detect fatigue and distracted driving amidst the complex backgrounds typical within vehicles. To extract driver and facial information as well as gradient details more efficiently, we introduce the Multihead Difference Kernel Convolution Module (MDKC) and Multiscale Large Convolutional Fusion Module (MLCF) in baseline. This incorporates a blend of Multihead Mixed Convolution and Large and Small Convolutional Kernels to amplify the spatial intricacies of the backbone. To extract gradient details from different illumination and noise feature maps, we enhance the network’s neck by introducing the Adaptive Convolutional Attention Module (ACAM) in NECK, optimizing feature retention. Extensive comparative experiments validate the efficacy of our network, showcasing superior performance not only on the Fatigue and Distracted Driving Dataset but also competitive results on the public COCO dataset. Source code is available at https://github.com/SCNU-RISLAB/MFDD.
{"title":"Mfdd: Multi-scale attention fatigue and distracted driving detector based on facial features","authors":"Yulin Shi, Jintao Cheng, Xingming Chen, Jiehao Luo, Xiaoyu Tang","doi":"10.1007/s11554-024-01549-y","DOIUrl":"https://doi.org/10.1007/s11554-024-01549-y","url":null,"abstract":"<p>With the rapid expansion of the automotive industry and the continuous growth of vehicle fleets, traffic safety has become a critical global social issue. Developing detection and alert systems for fatigue and distracted driving is essential for enhancing traffic safety. Factors, such as variations in the driver’s facial details, lighting conditions, and camera pixel quality, significantly affect the accuracy of fatigue and distracted driving detection, often resulting in the low effectiveness of existing methods. This study introduces a new network designed to detect fatigue and distracted driving amidst the complex backgrounds typical within vehicles. To extract driver and facial information as well as gradient details more efficiently, we introduce the Multihead Difference Kernel Convolution Module (MDKC) and Multiscale Large Convolutional Fusion Module (MLCF) in baseline. This incorporates a blend of Multihead Mixed Convolution and Large and Small Convolutional Kernels to amplify the spatial intricacies of the backbone. To extract gradient details from different illumination and noise feature maps, we enhance the network’s neck by introducing the Adaptive Convolutional Attention Module (ACAM) in NECK, optimizing feature retention. Extensive comparative experiments validate the efficacy of our network, showcasing superior performance not only on the Fatigue and Distracted Driving Dataset but also competitive results on the public COCO dataset. Source code is available at https://github.com/SCNU-RISLAB/MFDD.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"59 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1007/s11554-024-01537-2
Sahil Khokhar, Deepak Kedia
The paper introduces an integrated methodology for license plate character recognition, combining YOLOv8 for segmentation and a CSPBottleneck-based CNN classifier for character recognition. The proposed approach incorporates pre-processing techniques to enhance the recognition of partial plates and augmentation methods to address challenges arising from colour diversity. Performance analysis demonstrates YOLOv8’s high segmentation accuracy and fast processing time, complemented by precise character recognition and efficient processing by the CNN classifier. The integrated system achieves an overall accuracy of 99.02% with a total processing time of 9.9 ms, offering a robust solution for automated license plate recognition (ALPR) systems. The integrated approach presented in the paper holds promise for the practical implementation of ALPR technology and further development in the field of license plate recognition systems.
{"title":"Integrating YOLOv8 and CSPBottleneck based CNN for enhanced license plate character recognition","authors":"Sahil Khokhar, Deepak Kedia","doi":"10.1007/s11554-024-01537-2","DOIUrl":"https://doi.org/10.1007/s11554-024-01537-2","url":null,"abstract":"<p>The paper introduces an integrated methodology for license plate character recognition, combining YOLOv8 for segmentation and a CSPBottleneck-based CNN classifier for character recognition. The proposed approach incorporates pre-processing techniques to enhance the recognition of partial plates and augmentation methods to address challenges arising from colour diversity. Performance analysis demonstrates YOLOv8’s high segmentation accuracy and fast processing time, complemented by precise character recognition and efficient processing by the CNN classifier. The integrated system achieves an overall accuracy of 99.02% with a total processing time of 9.9 ms, offering a robust solution for automated license plate recognition (ALPR) systems. The integrated approach presented in the paper holds promise for the practical implementation of ALPR technology and further development in the field of license plate recognition systems.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"112 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1007/s11554-024-01527-4
Hongli Sun, Qingwu Fan, Huiqing Zhang, Jiajing Liu
Simultaneous Localization and Mapping (SLAM) is the core technology enabling mobile robots to autonomously explore and perceive the environment. However, dynamic objects in the scene significantly impact the accuracy and robustness of visual SLAM systems, limiting its applicability in real-world scenarios. Hence, we propose a real-time RGB-D visual SLAM algorithm designed for indoor dynamic scenes. Our approach includes a parallel lightweight object detection thread, which leverages the YOLOv7-tiny network to detect potential moving objects and generate 2D semantic information. Subsequently, a novel dynamic feature removal strategy is introduced in the tracking thread. This strategy integrates semantic information, geometric constraints, and feature point depth-based RANSAC to effectively mitigate the influence of dynamic features. To evaluate the effectiveness of the proposed algorithms, we conducted comparative experiments using other state-of-the-art algorithms on the TUM RGB-D dataset and Bonn RGB-D dataset, as well as in real-world dynamic scenes. The results demonstrate that the algorithm maintains excellent accuracy and robustness in dynamic environments, while also exhibiting impressive real-time performance.
同步定位与绘图(SLAM)是移动机器人自主探索和感知环境的核心技术。然而,场景中的动态物体会严重影响视觉 SLAM 系统的准确性和鲁棒性,从而限制了其在现实世界中的应用。因此,我们提出了一种专为室内动态场景设计的实时 RGB-D 视觉 SLAM 算法。我们的方法包括一个并行的轻量级物体检测线程,它利用 YOLOv7-tiny 网络来检测潜在的移动物体并生成二维语义信息。随后,在跟踪线程中引入了一种新颖的动态特征去除策略。该策略整合了语义信息、几何约束和基于特征点深度的 RANSAC,可有效减轻动态特征的影响。为了评估所提出算法的有效性,我们在 TUM RGB-D 数据集和波恩 RGB-D 数据集以及真实世界的动态场景中使用其他最先进的算法进行了对比实验。结果表明,该算法在动态环境中保持了出色的准确性和鲁棒性,同时还表现出令人印象深刻的实时性能。
{"title":"A real-time visual SLAM based on semantic information and geometric information in dynamic environment","authors":"Hongli Sun, Qingwu Fan, Huiqing Zhang, Jiajing Liu","doi":"10.1007/s11554-024-01527-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01527-4","url":null,"abstract":"<p>Simultaneous Localization and Mapping (SLAM) is the core technology enabling mobile robots to autonomously explore and perceive the environment. However, dynamic objects in the scene significantly impact the accuracy and robustness of visual SLAM systems, limiting its applicability in real-world scenarios. Hence, we propose a real-time RGB-D visual SLAM algorithm designed for indoor dynamic scenes. Our approach includes a parallel lightweight object detection thread, which leverages the YOLOv7-tiny network to detect potential moving objects and generate 2D semantic information. Subsequently, a novel dynamic feature removal strategy is introduced in the tracking thread. This strategy integrates semantic information, geometric constraints, and feature point depth-based RANSAC to effectively mitigate the influence of dynamic features. To evaluate the effectiveness of the proposed algorithms, we conducted comparative experiments using other state-of-the-art algorithms on the TUM RGB-D dataset and Bonn RGB-D dataset, as well as in real-world dynamic scenes. The results demonstrate that the algorithm maintains excellent accuracy and robustness in dynamic environments, while also exhibiting impressive real-time performance.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"46 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Images captured by Unmanned Aerial Vehicles (UAVs) play a significant role in many fields. However, with the development of UAV technology, challenges such as detecting small and dense objects against complex backgrounds have emerged. In this paper, we propose LGFF-YOLO, a detection model that integrates a novel local–global feature fusion method with the YOLOv8 baseline, specifically designed for small object detection in UAV imagery. Our innovative approach employs the Global Information Fusion Module (GIFM) and the Four-Leaf Clover Fusion Module (FLCM) to enhance the fusion of multi-scale features, improving detection accuracy without increasing model complexity. Next, we proposed the RFA-Block and LDyHead to control the total number of model parameters and improve the representation capability for small object detection. Experimental results on the VisDrone2019 dataset demonstrate a 38.3% mAP with only 4.15M parameters, a 4. 5% increase over baseline YOLOv8, while achieving 79.1 FPS for real-time detection. These advancements enhance the model’s generalization capability, balancing accuracy and speed, and significantly extend its applicability for detecting small objects in UAV images.
{"title":"LGFF-YOLO: small object detection method of UAV images based on efficient local–global feature fusion","authors":"Hongxing Peng, Haopei Xie, Huanai Liu, Xianlu Guan","doi":"10.1007/s11554-024-01550-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01550-5","url":null,"abstract":"<p>Images captured by Unmanned Aerial Vehicles (UAVs) play a significant role in many fields. However, with the development of UAV technology, challenges such as detecting small and dense objects against complex backgrounds have emerged. In this paper, we propose LGFF-YOLO, a detection model that integrates a novel local–global feature fusion method with the YOLOv8 baseline, specifically designed for small object detection in UAV imagery. Our innovative approach employs the Global Information Fusion Module (GIFM) and the Four-Leaf Clover Fusion Module (FLCM) to enhance the fusion of multi-scale features, improving detection accuracy without increasing model complexity. Next, we proposed the RFA-Block and LDyHead to control the total number of model parameters and improve the representation capability for small object detection. Experimental results on the VisDrone2019 dataset demonstrate a 38.3% mAP with only 4.15M parameters, a 4. 5% increase over baseline YOLOv8, while achieving 79.1 FPS for real-time detection. These advancements enhance the model’s generalization capability, balancing accuracy and speed, and significantly extend its applicability for detecting small objects in UAV images.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"29 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s11554-024-01548-z
Binlin Zhang, Qing Yang, Fengkui Chen, Dexin Gao
In response to the current challenges of numerous background influencing factors and low detection accuracy in the open railway foreign object detection, a real-time foreign object detection method based on deep learning for open railways in complex environments is proposed. Firstly, the images of foreign objects invading the clearance collected by locomotives during long-term operation are used to create a railway foreign object dataset that fits the current situation. Then, to improve the performance of the target detection algorithm, certain improvements are made to the YOLOv7-tiny network structure. The improved algorithm enhances feature extraction capability and strengthens detection performance. By introducing a Simple, parameter-free Attention Module for convolutional neural network (SimAM) attention mechanism, the representation ability of ConvNets is improved without adding extra parameters. Additionally, drawing on the network structure of the weighted Bi-directional Feature Pyramid Network (BiFPN), the backbone network achieves cross-level feature fusion by adding edges and neck fusion. Subsequently, the feature fusion layer is improved by introducing the GhostNetV2 module, which enhances the fusion capability of different scale features and greatly reduces computational load. Furthermore, the original loss function is replaced with the Normalized Wasserstein Distance (NWD) loss function to enhance the recognition capability of small distant targets. Finally, the proposed algorithm is trained and validated, and compared with other mainstream detection algorithms based on the established railway foreign object dataset. Experimental results show that the proposed algorithm achieves applicability and real-time performance on embedded devices, with high accuracy, improved model performance, and provides precise data support for railway safety assurance.
{"title":"A real-time foreign object detection method based on deep learning in complex open railway environments","authors":"Binlin Zhang, Qing Yang, Fengkui Chen, Dexin Gao","doi":"10.1007/s11554-024-01548-z","DOIUrl":"https://doi.org/10.1007/s11554-024-01548-z","url":null,"abstract":"<p>In response to the current challenges of numerous background influencing factors and low detection accuracy in the open railway foreign object detection, a real-time foreign object detection method based on deep learning for open railways in complex environments is proposed. Firstly, the images of foreign objects invading the clearance collected by locomotives during long-term operation are used to create a railway foreign object dataset that fits the current situation. Then, to improve the performance of the target detection algorithm, certain improvements are made to the YOLOv7-tiny network structure. The improved algorithm enhances feature extraction capability and strengthens detection performance. By introducing a Simple, parameter-free Attention Module for convolutional neural network (SimAM) attention mechanism, the representation ability of ConvNets is improved without adding extra parameters. Additionally, drawing on the network structure of the weighted Bi-directional Feature Pyramid Network (BiFPN), the backbone network achieves cross-level feature fusion by adding edges and neck fusion. Subsequently, the feature fusion layer is improved by introducing the GhostNetV2 module, which enhances the fusion capability of different scale features and greatly reduces computational load. Furthermore, the original loss function is replaced with the Normalized Wasserstein Distance (NWD) loss function to enhance the recognition capability of small distant targets. Finally, the proposed algorithm is trained and validated, and compared with other mainstream detection algorithms based on the established railway foreign object dataset. Experimental results show that the proposed algorithm achieves applicability and real-time performance on embedded devices, with high accuracy, improved model performance, and provides precise data support for railway safety assurance.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"7 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1007/s11554-024-01542-5
Shangwang Liu, Bingyan Zhou, Yinghai Lin, Peixia Wang
Accurate segmentation of skin lesions is essential for physicians to screen in dermoscopy images. However, they commonly face three main limitations: difficulty in accurately processing targets with coarse edges; frequent challenges in recovering detailed feature data; and a lack of adequate capability for the effective amalgamation of multi-scale features. To overcome these problems, we propose a skin lesion segmentation network (SFCC Net) that combines an attention mechanism and a redundancy reduction strategy. The initial step involved the design of a downsampling encoder and an encoder composed of Receptive Field (REFC) Blocks, aimed at supplementing lost details and extracting latent features. Subsequently, the Spatial-Frequency-Channel (SF) Block was employed to minimize feature redundancy and restore fine-grained information. To fully leverage previously learned features, an Up-sampling Convolution (UpC) Block was designed for information integration. The network’s performance was compared with state-of-the-art models on four public datasets. Experimental results demonstrate significant improvements in the network’s performance. On the ISIC datasets, the proposed network outperformed D-LKA Net by 4.19%, 0.19%, and 7.75% in F1, and by 2.14%, 0.51%, and 12.20% in IoU. The frame rate (FPS) of the proposed network when processing skin lesion images underscores its suitability for real-time image analysis. Additionally, the network’s generalization capability was validated on a lung dataset.
准确分割皮肤病变对于医生在皮肤镜图像中进行筛查至关重要。然而,它们通常面临三大限制:难以准确处理边缘粗糙的目标;在恢复详细特征数据时经常遇到挑战;缺乏有效合并多尺度特征的能力。为了克服这些问题,我们提出了一种皮肤病变分割网络(SFCC Net),它结合了注意力机制和减少冗余策略。第一步是设计一个降采样编码器和一个由感受野(REFC)块组成的编码器,目的是补充丢失的细节并提取潜在特征。随后,空间-频率-通道(SF)区块被用于最大限度地减少特征冗余和恢复细粒度信息。为了充分利用之前学习到的特征,设计了一个上采样卷积(UpC)区块进行信息整合。该网络的性能在四个公共数据集上与最先进的模型进行了比较。实验结果表明,该网络的性能有了显著提高。在 ISIC 数据集上,拟议网络的 F1 性能分别比 D-LKA Net 高出 4.19%、0.19% 和 7.75%,IoU 性能分别比 D-LKA Net 高出 2.14%、0.51% 和 12.20%。该网络在处理皮肤病变图像时的帧速率(FPS)突出表明了其适用于实时图像分析的能力。此外,该网络的泛化能力也在肺部数据集上得到了验证。
{"title":"Efficient and real-time skin lesion image segmentation using spatial-frequency information and channel convolutional networks","authors":"Shangwang Liu, Bingyan Zhou, Yinghai Lin, Peixia Wang","doi":"10.1007/s11554-024-01542-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01542-5","url":null,"abstract":"<p>Accurate segmentation of skin lesions is essential for physicians to screen in dermoscopy images. However, they commonly face three main limitations: difficulty in accurately processing targets with coarse edges; frequent challenges in recovering detailed feature data; and a lack of adequate capability for the effective amalgamation of multi-scale features. To overcome these problems, we propose a skin lesion segmentation network (SFCC Net) that combines an attention mechanism and a redundancy reduction strategy. The initial step involved the design of a downsampling encoder and an encoder composed of Receptive Field (REFC) Blocks, aimed at supplementing lost details and extracting latent features. Subsequently, the Spatial-Frequency-Channel (SF) Block was employed to minimize feature redundancy and restore fine-grained information. To fully leverage previously learned features, an Up-sampling Convolution (UpC) Block was designed for information integration. The network’s performance was compared with state-of-the-art models on four public datasets. Experimental results demonstrate significant improvements in the network’s performance. On the ISIC datasets, the proposed network outperformed D-LKA Net by 4.19%, 0.19%, and 7.75% in F1, and by 2.14%, 0.51%, and 12.20% in IoU. The frame rate (FPS) of the proposed network when processing skin lesion images underscores its suitability for real-time image analysis. Additionally, the network’s generalization capability was validated on a lung dataset.\u0000</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"60 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}