Pub Date : 2024-08-15DOI: 10.1007/s11554-024-01526-5
Xin Li, Changhai Ru, Haonan Sun
Real-time visual image prediction, crucial for directing robotic arm movements, represents a significant technique in artificial intelligence and robotics. The primary technical challenges involve the robot’s inaccurate perception and understanding of the environment, coupled with imprecise control of movements. This study proposes ForGAN-MCTS, a generative adversarial network-based action sequence prediction algorithm, aimed at refining visually guided rearrangement planning for movable objects. Initially, the algorithm unveils a scalable and robust strategy for rearrangement planning, capitalizing on the capabilities of a Monte Carlo Tree Search strategy. Secondly, to enable the robot’s successful execution of grasping maneuvers, the algorithm proposes a generative adversarial network-based real-time prediction method, employing a network trained solely on synthetic data for robust estimation of multi-object workspace states via a single uncalibrated RGB camera. The efficacy of the newly proposed algorithm is corroborated through extensive experiments conducted by using a UR-5 robotic arm. The experimental results demonstrate that the algorithm surpasses existing methods in terms of planning efficacy and processing speed. Additionally, the algorithm is robust to camera motion and can effectively mitigate the effects of external perturbations.
实时视觉图像预测对指导机械臂运动至关重要,是人工智能和机器人技术中的一项重要技术。主要的技术挑战包括机器人对环境的感知和理解不准确,以及对动作的控制不精确。本研究提出了一种基于生成对抗网络的动作序列预测算法 ForGAN-MCTS,旨在完善可移动物体的视觉引导重新排列规划。首先,该算法利用蒙特卡洛树搜索(Monte Carlo Tree Search)策略的能力,为重新排列规划揭示了一种可扩展且稳健的策略。其次,为了使机器人能够成功执行抓取动作,该算法提出了一种基于生成对抗网络的实时预测方法,该方法仅使用合成数据训练的网络,通过单个未校准的 RGB 摄像头对多物体工作区状态进行稳健估计。通过使用 UR-5 机械臂进行大量实验,证实了新提出算法的有效性。实验结果表明,该算法在规划效率和处理速度方面都超越了现有方法。此外,该算法对相机运动具有鲁棒性,并能有效减轻外部扰动的影响。
{"title":"Adversarial generative learning and timed path optimization for real-time visual image prediction to guide robot arm movements","authors":"Xin Li, Changhai Ru, Haonan Sun","doi":"10.1007/s11554-024-01526-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01526-5","url":null,"abstract":"<p>Real-time visual image prediction, crucial for directing robotic arm movements, represents a significant technique in artificial intelligence and robotics. The primary technical challenges involve the robot’s inaccurate perception and understanding of the environment, coupled with imprecise control of movements. This study proposes ForGAN-MCTS, a generative adversarial network-based action sequence prediction algorithm, aimed at refining visually guided rearrangement planning for movable objects. Initially, the algorithm unveils a scalable and robust strategy for rearrangement planning, capitalizing on the capabilities of a Monte Carlo Tree Search strategy. Secondly, to enable the robot’s successful execution of grasping maneuvers, the algorithm proposes a generative adversarial network-based real-time prediction method, employing a network trained solely on synthetic data for robust estimation of multi-object workspace states via a single uncalibrated RGB camera. The efficacy of the newly proposed algorithm is corroborated through extensive experiments conducted by using a UR-5 robotic arm. The experimental results demonstrate that the algorithm surpasses existing methods in terms of planning efficacy and processing speed. Additionally, the algorithm is robust to camera motion and can effectively mitigate the effects of external perturbations.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"7 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s11554-024-01531-8
Hyeonbeen Lee, Jangho Lee
Crowd counting, the task of estimating the total number of people in an image, is essential for intelligent surveillance. Integrating a well-trained crowd counting network into edge devices, such as intelligent CCTV systems, enables its application across various domains, including the prevention of crowd collapses and urban planning. For a model to be embedded in edge devices, it requires robust performance, reduced parameter count, and faster response times. This study proposes a lightweight and powerful model called TinyCount, which has only 60k parameters. The proposed TinyCount is a fully convolutional network consisting of a feature extract module (FEM) for robust and rapid feature extraction, a scale perception module (SPM) for scale variation perception and an upsampling module (UM) that adjusts the feature map to the same size as the original image. TinyCount demonstrated competitive performance across three representative crowd counting datasets, despite utilizing approximately 3.33 to 271 times fewer parameters than other crowd counting approaches. The proposed model achieved relatively fast inference times by leveraging the MobileNetV2 architecture with dilated and transposed convolutions. The application of SEblock and findings from existing studies further proved its effectiveness. Finally, we evaluated the proposed TinyCount on multiple edge devices, including the Raspberry Pi 4, NVIDIA Jetson Nano, and NVIDIA Jetson AGX Xavier, to demonstrate its potential for practical applications.
{"title":"TinyCount: an efficient crowd counting network for intelligent surveillance","authors":"Hyeonbeen Lee, Jangho Lee","doi":"10.1007/s11554-024-01531-8","DOIUrl":"https://doi.org/10.1007/s11554-024-01531-8","url":null,"abstract":"<p>Crowd counting, the task of estimating the total number of people in an image, is essential for intelligent surveillance. Integrating a well-trained crowd counting network into edge devices, such as intelligent CCTV systems, enables its application across various domains, including the prevention of crowd collapses and urban planning. For a model to be embedded in edge devices, it requires robust performance, reduced parameter count, and faster response times. This study proposes a lightweight and powerful model called TinyCount, which has only 60<i>k</i> parameters. The proposed TinyCount is a fully convolutional network consisting of a feature extract module (FEM) for robust and rapid feature extraction, a scale perception module (SPM) for scale variation perception and an upsampling module (UM) that adjusts the feature map to the same size as the original image. TinyCount demonstrated competitive performance across three representative crowd counting datasets, despite utilizing approximately 3.33 to 271 times fewer parameters than other crowd counting approaches. The proposed model achieved relatively fast inference times by leveraging the MobileNetV2 architecture with dilated and transposed convolutions. The application of SEblock and findings from existing studies further proved its effectiveness. Finally, we evaluated the proposed TinyCount on multiple edge devices, including the Raspberry Pi 4, NVIDIA Jetson Nano, and NVIDIA Jetson AGX Xavier, to demonstrate its potential for practical applications.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"9 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-12DOI: 10.1007/s11554-024-01534-5
Wanchun Ren, Pengcheng Zhu, Shaofeng Cai, Yi Huang, Haoran Zhao, Youji Hama, Zhu Yan, Tao Zhou, Junde Pu, Hongwei Yang
As the mainstream chip packaging technology, plastic-encapsulated chips (PEC) suffer from process defects such as delamination and voids, which seriously impact the chip's reliability. Therefore, it is urgent to detect defects promptly and accurately. However, the current manual detection methods cannot meet the application's requirements, as they are both inaccurate and inefficient. This study utilized the deep convolutional neural network (DCNN) technique to analyze PEC's scanning acoustic microscope (SAM) images and identify their internal defects. First, the SAM technology was used to collect and set up datasets of seven typical PEC defects. Then, according to the characteristics of densely packed PEC and an incredibly tiny size ratio in SAM, a PECNet network was established to detect PEC based on the traditional RetinaNet network, combining the CoTNet50 backbone network and the feature pyramid network structure. Furthermore, a PEDNet was designed to classify PEC defects based on the MobileNetV2 network, integrating cross-local connections and progressive classifiers. The experimental results demonstrated that the PECNet network's chip recognition accuracy reaches 98.6%, and its speed of a single image requires only nine milliseconds. Meanwhile, the PEDNet network's average defect classification accuracy is 97.8%, and the recognition speed of a single image is only 0.0021 s. This method provides a precise and efficient technique for defect detection in PEC.
{"title":"Automatic detection of defects in electronic plastic packaging using deep convolutional neural networks","authors":"Wanchun Ren, Pengcheng Zhu, Shaofeng Cai, Yi Huang, Haoran Zhao, Youji Hama, Zhu Yan, Tao Zhou, Junde Pu, Hongwei Yang","doi":"10.1007/s11554-024-01534-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01534-5","url":null,"abstract":"<p>As the mainstream chip packaging technology, plastic-encapsulated chips (PEC) suffer from process defects such as delamination and voids, which seriously impact the chip's reliability. Therefore, it is urgent to detect defects promptly and accurately. However, the current manual detection methods cannot meet the application's requirements, as they are both inaccurate and inefficient. This study utilized the deep convolutional neural network (DCNN) technique to analyze PEC's scanning acoustic microscope (SAM) images and identify their internal defects. First, the SAM technology was used to collect and set up datasets of seven typical PEC defects. Then, according to the characteristics of densely packed PEC and an incredibly tiny size ratio in SAM, a PECNet network was established to detect PEC based on the traditional RetinaNet network, combining the CoTNet50 backbone network and the feature pyramid network structure. Furthermore, a PEDNet was designed to classify PEC defects based on the MobileNetV2 network, integrating cross-local connections and progressive classifiers. The experimental results demonstrated that the PECNet network's chip recognition accuracy reaches 98.6%, and its speed of a single image requires only nine milliseconds. Meanwhile, the PEDNet network's average defect classification accuracy is 97.8%, and the recognition speed of a single image is only 0.0021 s. This method provides a precise and efficient technique for defect detection in PEC.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"79 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1007/s11554-024-01533-6
Li Liu, Kaiye Huang, Yuang Bai, Qifan Zhang, Yujian Li
Aiming at the issue that the existing aerial work safety belt wearing detection model cannot meet the real-time operation on edge devices, this paper proposes a lightweight aerial work safety belt detection model with higher accuracy. First, the model is made lightweight by introducing Ghost convolution and model pruning. Second, for complex scenarios involving occlusion, color confusion, etc., the model’s performance is optimized by introducing the new up-sampling operator, the attention mechanism, and the feature fusion network. Lastly, the model is trained using knowledge distillation to compensate for accuracy loss resulting from the lightweight design, thereby maintain a higher accuracy. Experimental results based on the Guangdong Power Grid Intelligence Challenge safety belt wearable dataset show that, in the comparison experiments, the improved model, compared with the mainstream object detection algorithm YOU ONLY LOOK ONCE v5s (YOLOv5s), has only 8.7% of the parameters of the former with only 3.7% difference in the mean Average Precision (mAP.50) metrics and the speed is improved by 100.4%. Meanwhile, the ablation experiments show that the improved model’s parameter count is reduced by 66.9% compared with the original model, while mAP.50 decreases by only 1.9%. The overhead safety belt detection model proposed in this paper combines the model’s lightweight design, SimAM attention mechanism, Bidirectional Feature Pyramid Network feature fusion network, Carafe operator, and knowledge distillation training strategy, enabling the model to maintain lightweight and real-time performance while achieving high detection accuracy.
针对现有高空作业安全带佩戴检测模型无法满足边缘设备实时操作的问题,本文提出了一种精度更高的轻量级高空作业安全带检测模型。首先,通过引入 Ghost 卷积和模型剪枝使模型轻量化。其次,对于涉及遮挡、颜色混淆等复杂场景,通过引入新的上采样算子、注意力机制和特征融合网络,优化了模型的性能。最后,利用知识提炼对模型进行训练,以弥补轻量级设计带来的精度损失,从而保持更高的精度。基于广东电网智能挑战赛安全带可穿戴数据集的实验结果表明,在对比实验中,改进后的模型与主流物体检测算法YOU ONLY LOOK ONCE v5s(YOLOv5s)相比,参数仅为前者的8.7%,平均精度(mAP.50)指标仅相差3.7%,速度提高了100.4%。同时,消融实验表明,改进模型的参数数比原始模型减少了 66.9%,而 mAP.50 仅减少了 1.9%。本文提出的架空安全带检测模型结合了模型的轻量级设计、SimAM 注意机制、双向特征金字塔网络特征融合网络、Carafe 算子和知识蒸馏训练策略,使模型在实现高检测精度的同时保持了轻量级和实时性。
{"title":"Real-time detection model of electrical work safety belt based on lightweight improved YOLOv5","authors":"Li Liu, Kaiye Huang, Yuang Bai, Qifan Zhang, Yujian Li","doi":"10.1007/s11554-024-01533-6","DOIUrl":"https://doi.org/10.1007/s11554-024-01533-6","url":null,"abstract":"<p>Aiming at the issue that the existing aerial work safety belt wearing detection model cannot meet the real-time operation on edge devices, this paper proposes a lightweight aerial work safety belt detection model with higher accuracy. First, the model is made lightweight by introducing Ghost convolution and model pruning. Second, for complex scenarios involving occlusion, color confusion, etc., the model’s performance is optimized by introducing the new up-sampling operator, the attention mechanism, and the feature fusion network. Lastly, the model is trained using knowledge distillation to compensate for accuracy loss resulting from the lightweight design, thereby maintain a higher accuracy. Experimental results based on the Guangdong Power Grid Intelligence Challenge safety belt wearable dataset show that, in the comparison experiments, the improved model, compared with the mainstream object detection algorithm YOU ONLY LOOK ONCE v5s (YOLOv5s), has only 8.7% of the parameters of the former with only 3.7% difference in the mean Average Precision (mAP.50) metrics and the speed is improved by 100.4%. Meanwhile, the ablation experiments show that the improved model’s parameter count is reduced by 66.9% compared with the original model, while mAP.50 decreases by only 1.9%. The overhead safety belt detection model proposed in this paper combines the model’s lightweight design, SimAM attention mechanism, Bidirectional Feature Pyramid Network feature fusion network, Carafe operator, and knowledge distillation training strategy, enabling the model to maintain lightweight and real-time performance while achieving high detection accuracy.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"7 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the (hbox {MPJPE}_{3D}) to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.
{"title":"EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation","authors":"Xin Zhao, Lianping Yang, Wencong Huang, Qi Wang, Xin Wang, Yantao Lou","doi":"10.1007/s11554-024-01528-3","DOIUrl":"https://doi.org/10.1007/s11554-024-01528-3","url":null,"abstract":"<p>Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the <span>(hbox {MPJPE}_{3D})</span> to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"85 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1007/s11554-024-01524-7
Bingquan Wang, Fangling Yang
With the rapid development of artificial intelligence and Big Data, the application of artificial intelligence-generated image content (AIGIC) is becoming increasingly widespread in various fields. However, the image data utilized by AIGIC is diverse and often contains sensitive personal information, characterized by heterogeneity and privacy concerns. This leads to prolonged implementation times for image data privacy protection, and a high risk of unauthorized third-party access, resulting in serious privacy breaches and security risks. To address this issue, this paper combines Hierarchical Federated Learning (HFL) with Homomorphic Encryption to first address the encryption and transmission challenges in the image processing pipeline of AIGIC. Building upon this foundation, a novel HFL group collaborative training strategy is designed to further streamline the privacy protection process of AIGIC image data, effectively masking the heterogeneity of raw image data and achieving balanced allocation of computational resources. Additionally, a model compression algorithm based on pruning is introduced to alleviate the data transmission pressure in the image encryption process. Optimization of the homomorphic encryption modulo operations significantly reduces the computational burden, enabling real-time enhancement of image data privacy protection from multiple dimensions including computational and transmission resources. To verify the effectiveness of the proposed mechanism, extensive simulation verification of the lightweight privacy protection process for AIGIC image data was performed, and a comparative analysis of the time complexity of the mechanism was conducted. Experimental results indicate substantial advantages of the proposed algorithm over traditional real-time privacy protection algorithms in AIGIC.
{"title":"Lightweight and privacy-preserving hierarchical federated learning mechanism for artificial intelligence-generated image content","authors":"Bingquan Wang, Fangling Yang","doi":"10.1007/s11554-024-01524-7","DOIUrl":"https://doi.org/10.1007/s11554-024-01524-7","url":null,"abstract":"<p>With the rapid development of artificial intelligence and Big Data, the application of artificial intelligence-generated image content (AIGIC) is becoming increasingly widespread in various fields. However, the image data utilized by AIGIC is diverse and often contains sensitive personal information, characterized by heterogeneity and privacy concerns. This leads to prolonged implementation times for image data privacy protection, and a high risk of unauthorized third-party access, resulting in serious privacy breaches and security risks. To address this issue, this paper combines Hierarchical Federated Learning (HFL) with Homomorphic Encryption to first address the encryption and transmission challenges in the image processing pipeline of AIGIC. Building upon this foundation, a novel HFL group collaborative training strategy is designed to further streamline the privacy protection process of AIGIC image data, effectively masking the heterogeneity of raw image data and achieving balanced allocation of computational resources. Additionally, a model compression algorithm based on pruning is introduced to alleviate the data transmission pressure in the image encryption process. Optimization of the homomorphic encryption modulo operations significantly reduces the computational burden, enabling real-time enhancement of image data privacy protection from multiple dimensions including computational and transmission resources. To verify the effectiveness of the proposed mechanism, extensive simulation verification of the lightweight privacy protection process for AIGIC image data was performed, and a comparative analysis of the time complexity of the mechanism was conducted. Experimental results indicate substantial advantages of the proposed algorithm over traditional real-time privacy protection algorithms in AIGIC.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"4 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1007/s11554-024-01529-2
Shuren Zhou, Shengzhen Long
Wheat is one of the most significant crops in China, as its yield directly affects the country’s food security. Due to its dense, overlapping, and relatively fuzzy distribution, wheat spikes are prone to being missed in practical detection. Existing object detection models suffer from large model size, high computational complexity, and long computation times. Consequently, this study proposes a lightweight real-time wheat spike detection model called YOLO-LF. Initially, a lightweight backbone network is improved to reduce the model size and lower the number of parameters, thereby improving the runtime speed. Second, the structure of the neck is redesigned in the context of the wheat spike dataset to enhance the feature extraction capability of the network for wheat spikes and to achieve lightweightness. Finally, a lightweight detection head was designed to significantly reduce the FLOPs of the model and achieve further lightweighting. Experimental results on the test set indicate that the size of our model is 1.7 MB, the number of parameters is 0.76 M, and the FLOPs are 2.9, which represent reductions of 73, 74, and 64% compared to YOLOv8n, respectively. Our model demonstrates a latency of 8.6 ms and an FPS of 115 on Titan X, whereas YOLOv8n has a latency of 10.2 ms and an FPS of 97 on the same hardware. In contrast, our model is more lightweight and faster to detect, while the mAP@0.5 only decreases by 0.9%, outperforming YOLOv8 and other mainstream detection networks in overall performance. Consequently, our model can be deployed on mobile devices to provide effective assistance in the real-time detection of wheat spikes.
{"title":"YOLO-LF: a lightweight multi-scale feature fusion algorithm for wheat spike detection","authors":"Shuren Zhou, Shengzhen Long","doi":"10.1007/s11554-024-01529-2","DOIUrl":"https://doi.org/10.1007/s11554-024-01529-2","url":null,"abstract":"<p>Wheat is one of the most significant crops in China, as its yield directly affects the country’s food security. Due to its dense, overlapping, and relatively fuzzy distribution, wheat spikes are prone to being missed in practical detection. Existing object detection models suffer from large model size, high computational complexity, and long computation times. Consequently, this study proposes a lightweight real-time wheat spike detection model called YOLO-LF. Initially, a lightweight backbone network is improved to reduce the model size and lower the number of parameters, thereby improving the runtime speed. Second, the structure of the neck is redesigned in the context of the wheat spike dataset to enhance the feature extraction capability of the network for wheat spikes and to achieve lightweightness. Finally, a lightweight detection head was designed to significantly reduce the FLOPs of the model and achieve further lightweighting. Experimental results on the test set indicate that the size of our model is 1.7 MB, the number of parameters is 0.76 M, and the FLOPs are 2.9, which represent reductions of 73, 74, and 64% compared to YOLOv8n, respectively. Our model demonstrates a latency of 8.6 ms and an FPS of 115 on Titan X, whereas YOLOv8n has a latency of 10.2 ms and an FPS of 97 on the same hardware. In contrast, our model is more lightweight and faster to detect, while the mAP@0.5 only decreases by 0.9%, outperforming YOLOv8 and other mainstream detection networks in overall performance. Consequently, our model can be deployed on mobile devices to provide effective assistance in the real-time detection of wheat spikes.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"15 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current research on image classification has combined convolutional neural networks (CNNs) and transformers to introduce inductive biases to the model, enhancing its ability to handle long-range dependencies. However, these integrated models have limitations. Standard CNNs have a static nature, restricting their convolution from dynamically adjusting to input images, thus limiting feature expression capabilities. In addition, the static nature of CNNs impedes the seamless integration between features dynamically generated by self-attention mechanisms and static features generated by convolution when combined with transformers. Furthermore, during image processing, each model stage contains abundant information that cannot be fully utilized by single-scale convolution, ultimately impacting the network’s classification performance. To tackle these challenges, we propose WoodGLNet, a real-time multi-scale pyramid network that aggregates global and local information in an input-dependent manner and facilitates feature interaction through three scales of convolution. WoodGLNet utilizes efficient multi-scale global spatial decay attention modules and input-dependent multi-scale dynamic convolutions at different stages, enhancing the network’s inductive biases and expanding the effective receptive field. In CIFAR100 and CIFAR10 image classification tasks, WoodGLNet-T achieves Top-1 accuracies of 76.34% and 92.35%, respectively, outperforming EfficientNet-B3 by 1.03 and 0.86 percentage points. WoodGLNet-S and WoodGLNet-B attain Top-1 accuracies of 77.56%, 93.66%, and 80.12%, 94.27%, respectively. The experimental subjects of this study were sourced from the Shandong Province Construction Structural Material Specimen Museum, tasked with wood testing and requiring high real-time performance. To assess WoodGLNet’s real-time detection capabilities, 20 types of precious wood from the museum were identified in real time using the WoodGLNet network. The results indicated that WoodGLNet achieved a classification accuracy of up to 99.60%, with a recognition time of 0.013 s per single image. These findings demonstrate the network’s exceptional real-time classification and generalization abilities.
{"title":"WoodGLNet: a multi-scale network integrating global and local information for real-time classification of wood images","authors":"Zhishuai Zheng, Zhedong Ge, Zhikang Tian, Xiaoxia Yang, Yucheng Zhou","doi":"10.1007/s11554-024-01521-w","DOIUrl":"https://doi.org/10.1007/s11554-024-01521-w","url":null,"abstract":"<p>Current research on image classification has combined convolutional neural networks (CNNs) and transformers to introduce inductive biases to the model, enhancing its ability to handle long-range dependencies. However, these integrated models have limitations. Standard CNNs have a static nature, restricting their convolution from dynamically adjusting to input images, thus limiting feature expression capabilities. In addition, the static nature of CNNs impedes the seamless integration between features dynamically generated by self-attention mechanisms and static features generated by convolution when combined with transformers. Furthermore, during image processing, each model stage contains abundant information that cannot be fully utilized by single-scale convolution, ultimately impacting the network’s classification performance. To tackle these challenges, we propose WoodGLNet, a real-time multi-scale pyramid network that aggregates global and local information in an input-dependent manner and facilitates feature interaction through three scales of convolution. WoodGLNet utilizes efficient multi-scale global spatial decay attention modules and input-dependent multi-scale dynamic convolutions at different stages, enhancing the network’s inductive biases and expanding the effective receptive field. In CIFAR100 and CIFAR10 image classification tasks, WoodGLNet-T achieves Top-1 accuracies of 76.34% and 92.35%, respectively, outperforming EfficientNet-B3 by 1.03 and 0.86 percentage points. WoodGLNet-S and WoodGLNet-B attain Top-1 accuracies of 77.56%, 93.66%, and 80.12%, 94.27%, respectively. The experimental subjects of this study were sourced from the Shandong Province Construction Structural Material Specimen Museum, tasked with wood testing and requiring high real-time performance. To assess WoodGLNet’s real-time detection capabilities, 20 types of precious wood from the museum were identified in real time using the WoodGLNet network. The results indicated that WoodGLNet achieved a classification accuracy of up to 99.60%, with a recognition time of 0.013 s per single image. These findings demonstrate the network’s exceptional real-time classification and generalization abilities.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"24 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s11554-024-01530-9
Chenghai Yu, Xiangwei Chen
Railway turnouts are critical components of the rail track system, and their defects can lead to severe safety incidents and significant property damage. The irregular distribution and varying sizes of railway-turnout defects, combined with changing environmental lighting and complex backgrounds, pose challenges for traditional detection methods, often resulting in low accuracy and poor real-time performance. To address the issue of improving the detection performance of railway-turnout defects, this study proposes a high-precision recognition model, Faster-Hilo-BiFPN-DETR (FHB-DETR), based on the RT-DETR architecture. First, we designed the Faster CGLU module based on Faster Block, which optimizes the aggregation of local and global feature information through partial convolution and gating mechanisms. This approach reduces both computational load and parameter count while enhancing feature extraction capabilities. Second, we replaced the multi-head self-attention mechanism with Hilo attention, reducing parameter count and computational load, and improving real-time performance. In terms of feature fusion, we utilized BiFPN instead of CCFF to better capture subtle defect features and optimized the weighting of feature maps through a weighted mechanism. Experimental results show that compared to RT-DETR, FHB-DETR improved mAP50 by 3.5%, reduced parameter count by 25%, and decreased computational complexity by 6%, while maintaining a high frame rate, meeting real-time performance requirements.
{"title":"Railway rutting defects detection based on improved RT-DETR","authors":"Chenghai Yu, Xiangwei Chen","doi":"10.1007/s11554-024-01530-9","DOIUrl":"https://doi.org/10.1007/s11554-024-01530-9","url":null,"abstract":"<p>Railway turnouts are critical components of the rail track system, and their defects can lead to severe safety incidents and significant property damage. The irregular distribution and varying sizes of railway-turnout defects, combined with changing environmental lighting and complex backgrounds, pose challenges for traditional detection methods, often resulting in low accuracy and poor real-time performance. To address the issue of improving the detection performance of railway-turnout defects, this study proposes a high-precision recognition model, Faster-Hilo-BiFPN-DETR (FHB-DETR), based on the RT-DETR architecture. First, we designed the Faster CGLU module based on Faster Block, which optimizes the aggregation of local and global feature information through partial convolution and gating mechanisms. This approach reduces both computational load and parameter count while enhancing feature extraction capabilities. Second, we replaced the multi-head self-attention mechanism with Hilo attention, reducing parameter count and computational load, and improving real-time performance. In terms of feature fusion, we utilized BiFPN instead of CCFF to better capture subtle defect features and optimized the weighting of feature maps through a weighted mechanism. Experimental results show that compared to RT-DETR, FHB-DETR improved mAP50 by 3.5%, reduced parameter count by 25%, and decreased computational complexity by 6%, while maintaining a high frame rate, meeting real-time performance requirements.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"83 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s11554-024-01525-6
Mehmet Erkin Yücel, Serkan Topaloğlu, Cem Ünsalan
The retail sector presents several open and challenging problems that could benefit from advanced pattern recognition and computer vision techniques. One such critical challenge is planogram compliance control. In this study, we propose a complete embedded system to tackle this issue. Our system consists of four key components as image acquisition and transfer via stand-alone embedded camera module, object detection via computer vision and deep learning methods working on single-board computers, planogram compliance control method again working on single-board computers, and energy harvesting and power management block to accompany the embedded camera modules. The image acquisition and transfer block is implemented on the ESP-EYE camera module. The object detection block is based on YOLOv5 as the deep learning method and local feature extraction. We implement these methods on Raspberry Pi 4, NVIDIA Jetson Orin Nano, and NVIDIA Jetson AGX Orin as single-board computers. The planogram compliance control block utilizes sequence alignment through a modified Needleman–Wunsch algorithm. This block is also working along with the object detection block on the same single-board computers. The energy harvesting and power management block consists of solar and RF energy-harvesting modules with suitable battery pack for operation. We tested the proposed embedded planogram compliance control system on two different datasets to provide valuable insights on its strengths and weaknesses. The results show that the proposed method achieves F1 scores of 0.997 and 1.0 in object detection and planogram compliance control blocks, respectively. Furthermore, we calculated that the complete embedded system can work in stand-alone form up to 2 years based on battery. This duration can be further extended with the integration of the proposed solar and RF energy-harvesting options.
零售业面临着一些有待解决且极具挑战性的问题,先进的模式识别和计算机视觉技术可以帮助解决这些问题。其中一个关键挑战就是平面图合规性控制。在本研究中,我们提出了一个完整的嵌入式系统来解决这一问题。我们的系统由四个关键部分组成:通过独立的嵌入式摄像头模块进行图像采集和传输;通过单板计算机上的计算机视觉和深度学习方法进行物体检测;再次通过单板计算机进行平面图顺应性控制;以及与嵌入式摄像头模块配套的能量收集和电源管理模块。图像采集和传输模块在 ESP-EYE 摄像头模块上实现。物体检测模块基于 YOLOv5 作为深度学习方法和局部特征提取。我们在 Raspberry Pi 4、NVIDIA Jetson Orin Nano 和 NVIDIA Jetson AGX Orin 单板计算机上实现了这些方法。平面图顺应性控制模块通过改进的 Needleman-Wunsch 算法利用序列对齐。该程序块还与物体检测程序块一起在同一单板计算机上工作。能量收集和电源管理模块由太阳能和射频能量收集模块以及合适的电池组组成。我们在两个不同的数据集上测试了所提出的嵌入式平面图顺应性控制系统,以便对其优缺点提供有价值的见解。结果表明,在物体检测和平面图符合性控制模块中,建议的方法分别获得了 0.997 和 1.0 的 F1 分数。此外,我们还计算出,基于电池,整个嵌入式系统可独立工作长达 2 年。在集成了所建议的太阳能和射频能量收集方案后,这一持续时间还可进一步延长。
{"title":"Embedded planogram compliance control system","authors":"Mehmet Erkin Yücel, Serkan Topaloğlu, Cem Ünsalan","doi":"10.1007/s11554-024-01525-6","DOIUrl":"https://doi.org/10.1007/s11554-024-01525-6","url":null,"abstract":"<p>The retail sector presents several open and challenging problems that could benefit from advanced pattern recognition and computer vision techniques. One such critical challenge is planogram compliance control. In this study, we propose a complete embedded system to tackle this issue. Our system consists of four key components as image acquisition and transfer via stand-alone embedded camera module, object detection via computer vision and deep learning methods working on single-board computers, planogram compliance control method again working on single-board computers, and energy harvesting and power management block to accompany the embedded camera modules. The image acquisition and transfer block is implemented on the ESP-EYE camera module. The object detection block is based on YOLOv5 as the deep learning method and local feature extraction. We implement these methods on Raspberry Pi 4, NVIDIA Jetson Orin Nano, and NVIDIA Jetson AGX Orin as single-board computers. The planogram compliance control block utilizes sequence alignment through a modified Needleman–Wunsch algorithm. This block is also working along with the object detection block on the same single-board computers. The energy harvesting and power management block consists of solar and RF energy-harvesting modules with suitable battery pack for operation. We tested the proposed embedded planogram compliance control system on two different datasets to provide valuable insights on its strengths and weaknesses. The results show that the proposed method achieves F1 scores of 0.997 and 1.0 in object detection and planogram compliance control blocks, respectively. Furthermore, we calculated that the complete embedded system can work in stand-alone form up to 2 years based on battery. This duration can be further extended with the integration of the proposed solar and RF energy-harvesting options.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"58 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}