Journal of Real-Time Image Processing最新文献_第9页

$$eta$$ -repyolo: real-time object detection method based on $$eta$$ -RepConv and YOLOv8 $$eta$ -repyolo：基于 $$eta$ -RepConv 和 YOLOv8 的实时物体检测方法

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-05-03 DOI: 10.1007/s11554-024-01462-4

Shuai Feng, Huaming Qian, Huilin Wang, Wenna Wang

Deep learning-based object detection methods often grapple with excessive model parameters, high complexity, and subpar real-time performance. In response, the YOLO series, particularly the YOLOv5s to YOLOv8s methods, has been developed by scholars to strike a balance between real-time processing and accuracy. Nevertheless, YOLOv8’s precision can fall short in certain specific applications. To address this, we introduce a real-time object detection method called (eta)-RepYOLO, which is built upon the (eta)-RepConv structure. This method is designed to maintain consistent detection speeds while improving accuracy. We begin by crafting a backbone network named (eta)-EfficientRep, which utilizes a strategically designed network unit-(eta)-RepConv and (eta)-RepC2f module, to reparameterize and subsequently generate an efficient inference model. This model achieves superior performance by extracting detailed feature maps from images. Subsequently, we propose the enhanced (eta)-RepPANet and (eta)-RepAFPN as the model’s detection neck, with the addition of the (eta)-RepC2f for optimized feature fusion, thus boosting the neck’s functionality. Our innovation continues with the development of an advanced decoupled head for detection, where the (eta)-RepConv takes the place of the traditional (3 times 3) conv, resulting in a marked increase in detection precision during the inference stage. Our proposed (eta)-RepYOLO method, when applied to distinct neck modules, (eta)-RepPANet and (eta)-RepAFPN, achieves mAP of 84.77%/85.65% on the PASCAL VOC07+12 dataset and AP of 45.3%/45.8% on the MSCOCO dataset, respectively. These figures represent a significant advancement over the YOLOv8s method. Additionally, the model parameters for (eta)-RepYOLO are reduced to 10.8M/8.8M, which is 3.6%/21.4% less than that of YOLOv8, culminating in a more streamlined detection model. The detection speeds clocked on an RTX3060 are 116 FPS/81 FPS, showcasing a substantial enhancement in comparison to YOLOv8s. In summary, our approach delivers competitive performance and presents a more lightweight alternative to the SOTA YOLO models, making it a robust choice for real-time object detection applications.

基于深度学习的物体检测方法通常会面临模型参数过多、复杂度高、实时性差等问题。为此，学者们开发了 YOLO 系列，特别是 YOLOv5s 至 YOLOv8s 方法，以在实时处理和精度之间取得平衡。然而，在某些特定应用中，YOLOv8 的精度可能会有所欠缺。为了解决这个问题，我们引入了一种实时对象检测方法--RepYOLO，它建立在 (eta)-RepConv 结构之上。这种方法旨在保持稳定的检测速度，同时提高准确性。我们首先创建了一个名为（(eta)-EfficientRep）的骨干网络，它利用战略性设计的网络单元--（(eta)-RepConv 和（(eta)-RepC2f）模块，重新参数化并随后生成一个高效的推理模型。该模型通过从图像中提取详细的特征图实现了卓越的性能。随后，我们提出了增强型（eta）-RepPANet 和（eta）-RepAFPN 作为该模型的检测颈部，并添加了用于优化特征融合的（eta）-RepC2f，从而增强了颈部的功能。我们的创新还体现在为检测开发了一个先进的解耦头部，在这个头部中，(eta)-RepConv 取代了传统的(3times 3) conv，从而显著提高了推理阶段的检测精度。当我们提出的 (eta)-RepYOLO 方法应用于不同的颈部模块，即 (eta)-RepPANet 和 (eta)-RepAFPN 时，在 PASCAL VOC07+12 数据集上的 mAP 分别达到了 84.77%/85.65%，在 MSCOCO 数据集上的 AP 分别达到了 45.3%/45.8%。与 YOLOv8s 方法相比，这些数据都有显著提高。此外，（eta）-RepYOLO 的模型参数减少到 10.8M/8.8M, 比 YOLOv8 减少了 3.6%/21.4%, 最终形成了一个更精简的检测模型。在 RTX3060 上的检测速度为 116 FPS/81 FPS，与 YOLOv8 相比有了大幅提升。总之，我们的方法提供了具有竞争力的性能，并为 SOTA YOLO 模型提供了更轻便的替代方案，使其成为实时目标检测应用的可靠选择。

{"title":"$$eta$$ -repyolo: real-time object detection method based on $$eta$$ -RepConv and YOLOv8","authors":"Shuai Feng, Huaming Qian, Huilin Wang, Wenna Wang","doi":"10.1007/s11554-024-01462-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01462-4","url":null,"abstract":"Deep learning-based object detection methods often grapple with excessive model parameters, high complexity, and subpar real-time performance. In response, the YOLO series, particularly the YOLOv5s to YOLOv8s methods, has been developed by scholars to strike a balance between real-time processing and accuracy. Nevertheless, YOLOv8’s precision can fall short in certain specific applications. To address this, we introduce a real-time object detection method called (eta)-RepYOLO, which is built upon the (eta)-RepConv structure. This method is designed to maintain consistent detection speeds while improving accuracy. We begin by crafting a backbone network named (eta)-EfficientRep, which utilizes a strategically designed network unit-(eta)-RepConv and (eta)-RepC2f module, to reparameterize and subsequently generate an efficient inference model. This model achieves superior performance by extracting detailed feature maps from images. Subsequently, we propose the enhanced (eta)-RepPANet and (eta)-RepAFPN as the model’s detection neck, with the addition of the (eta)-RepC2f for optimized feature fusion, thus boosting the neck’s functionality. Our innovation continues with the development of an advanced decoupled head for detection, where the (eta)-RepConv takes the place of the traditional (3 times 3) conv, resulting in a marked increase in detection precision during the inference stage. Our proposed (eta)-RepYOLO method, when applied to distinct neck modules, (eta)-RepPANet and (eta)-RepAFPN, achieves mAP of 84.77%/85.65% on the PASCAL VOC07+12 dataset and AP of 45.3%/45.8% on the MSCOCO dataset, respectively. These figures represent a significant advancement over the YOLOv8s method. Additionally, the model parameters for (eta)-RepYOLO are reduced to 10.8M/8.8M, which is 3.6%/21.4% less than that of YOLOv8, culminating in a more streamlined detection model. The detection speeds clocked on an RTX3060 are 116 FPS/81 FPS, showcasing a substantial enhancement in comparison to YOLOv8s. In summary, our approach delivers competitive performance and presents a more lightweight alternative to the SOTA YOLO models, making it a robust choice for real-time object detection applications.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"31 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140882430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time and accurate model of instance segmentation of foods 实时准确的食品实例分割模型

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-30 DOI: 10.1007/s11554-024-01459-z

Yuhe Fan, Lixun Zhang, Canxing Zheng, Yunqin Zu, Keyi Wang, Xingyuan Wang

Instance segmentation of foods is an important technology to ensure the food success rate of meal-assisting robotics. However, due to foods have strong intraclass variability, interclass similarity, and complex physical properties, which leads to more challenges in recognition, localization, and contour acquisition of foods. To address the above issues, this paper proposed a novel method for instance segmentation of foods. Specifically, in backbone network, deformable convolution was introduced to enhance the ability of YOLOv8 architecture to capture finer-grained spatial information, and efficient multiscale attention based on cross-spatial learning was introduced to improve sensitivity and expressiveness of multiscale inputs. In neck network, classical convolution and C2f modules were replaced by lightweight convolution GSConv and improved VoV-GSCSP aggregation module, respectively, to improve inference speed of models. We abbreviated it as the DEG-YOLOv8n-seg model. The proposed method was compared with baseline model and several state-of-the-art (SOTA) segmentation models on datasets, respectively. The results show that the DEG-YOLOv8n-seg model has higher accuracy, faster speed, and stronger robustness. Specifically, the DEG-YOLOv8n-seg model can achieve 84.6% Box_mAP@0.5 and 84.1% Mask_mAP@0.5 accuracy at 55.2 FPS and 11.1 GFLOPs. The importance of adopting data augmentation and the effectiveness of introducing deformable convolution, EMA, and VoV-GSCSP were verified by ablation experiments. Finally, the DEG-YOLOv8n-seg model was applied to experiments of food instance segmentation for meal-assisting robots. The results show that the DEG-YOLOv8n-seg can achieve better instance segmentation of foods. This work can promote the development of intelligent meal-assisting robotics technology and can provide theoretical foundations for other tasks of the computer vision field with some reference value.

食品的实例分割是确保助餐机器人食品成功率的一项重要技术。然而，由于食物具有较强的类内差异性、类间相似性和复杂的物理特性，这给食物的识别、定位和轮廓获取带来了更多挑战。针对上述问题，本文提出了一种新颖的食品实例分割方法。具体来说，在骨干网络中，引入了可变形卷积，以增强 YOLOv8 架构捕捉更细粒度空间信息的能力；引入了基于跨空间学习的高效多尺度注意力，以提高多尺度输入的灵敏度和表现力。在颈部网络中，经典的卷积和 C2f 模块分别被轻量级卷积 GSConv 和改进的 VoV-GSCSP 聚合模块取代，以提高模型的推理速度。我们将其简称为 DEG-YOLOv8n-seg 模型。我们分别在数据集上将所提出的方法与基线模型和几种最先进的（SOTA）分割模型进行了比较。结果表明，DEG-YOLOv8n-seg 模型具有更高的准确性、更快的速度和更强的鲁棒性。具体来说，DEG-YOLOv8n-seg 模型能在 55.2 FPS 和 11.1 GFLOPs 的条件下实现 84.6% 的 Box_mAP@0.5 和 84.1% 的 Mask_mAP@0.5 准确率。烧蚀实验验证了采用数据增强的重要性以及引入可变形卷积、EMA 和 VoV-GSCSP 的有效性。最后，DEG-YOLOv8n-seg 模型被应用于助餐机器人的食物实例分割实验。结果表明，DEG-YOLOv8n-seg 可以实现更好的食物实例分割。这项工作能促进智能助餐机器人技术的发展，并能为计算机视觉领域的其他任务提供理论基础，具有一定的参考价值。

{"title":"Real-time and accurate model of instance segmentation of foods","authors":"Yuhe Fan, Lixun Zhang, Canxing Zheng, Yunqin Zu, Keyi Wang, Xingyuan Wang","doi":"10.1007/s11554-024-01459-z","DOIUrl":"https://doi.org/10.1007/s11554-024-01459-z","url":null,"abstract":"Instance segmentation of foods is an important technology to ensure the food success rate of meal-assisting robotics. However, due to foods have strong intraclass variability, interclass similarity, and complex physical properties, which leads to more challenges in recognition, localization, and contour acquisition of foods. To address the above issues, this paper proposed a novel method for instance segmentation of foods. Specifically, in backbone network, deformable convolution was introduced to enhance the ability of YOLOv8 architecture to capture finer-grained spatial information, and efficient multiscale attention based on cross-spatial learning was introduced to improve sensitivity and expressiveness of multiscale inputs. In neck network, classical convolution and C2f modules were replaced by lightweight convolution GSConv and improved VoV-GSCSP aggregation module, respectively, to improve inference speed of models. We abbreviated it as the DEG-YOLOv8n-seg model. The proposed method was compared with baseline model and several state-of-the-art (SOTA) segmentation models on datasets, respectively. The results show that the DEG-YOLOv8n-seg model has higher accuracy, faster speed, and stronger robustness. Specifically, the DEG-YOLOv8n-seg model can achieve 84.6% Box_mAP@0.5 and 84.1% Mask_mAP@0.5 accuracy at 55.2 FPS and 11.1 GFLOPs. The importance of adopting data augmentation and the effectiveness of introducing deformable convolution, EMA, and VoV-GSCSP were verified by ablation experiments. Finally, the DEG-YOLOv8n-seg model was applied to experiments of food instance segmentation for meal-assisting robots. The results show that the DEG-YOLOv8n-seg can achieve better instance segmentation of foods. This work can promote the development of intelligent meal-assisting robotics technology and can provide theoretical foundations for other tasks of the computer vision field with some reference value.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"21 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing UAV tracking: a focus on discriminative representations using contrastive instances 增强无人飞行器的跟踪能力：重点关注使用对比实例的判别表征

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-21 DOI: 10.1007/s11554-024-01456-2

Xucheng Wang, Dan Zeng, Yongxin Li, Mingliang Zou, Qijun Zhao, Shuiwang Li

Addressing the core challenges of achieving both high efficiency and precision in UAV tracking is crucial due to limitations in computing resources, battery capacity, and maximum load capacity on UAVs. Discriminative correlation filter (DCF)-based trackers excel in efficiency on a single CPU but lag in precision. In contrast, many lightweight deep learning (DL)-based trackers based on model compression strike a better balance between efficiency and precision. However, higher compression rates can hinder performance by diminishing discriminative representations. Given these challenges, our paper aims to enhance feature representations’ discriminative abilities through an innovative feature-learning approach. We specifically emphasize leveraging contrasting instances to achieve more distinct representations for effective UAV tracking. Our method eliminates the need for manual annotations and facilitates the creation and deployment of lightweight models. As far as our knowledge goes, we are the pioneers in exploring the possibilities of contrastive learning in UAV tracking applications. Through extensive experimentation across four UAV benchmarks, namely, UAVDT, DTB70, UAV123@10fps and VisDrone2018, We have shown that our DRCI (discriminative representation with contrastive instances) tracker outperforms current state-of-the-art UAV tracking methods, underscoring its potential to effectively tackle the persistent challenges in this field.

由于受到计算资源、电池容量和无人机最大负载能力的限制，解决无人机跟踪中实现高效率和高精度的核心难题至关重要。基于判别相关滤波器（DCF）的跟踪器在单个 CPU 上具有出色的效率，但在精度方面却相对落后。相比之下，许多基于模型压缩的轻量级深度学习（DL）跟踪器在效率和精度之间取得了更好的平衡。然而，较高的压缩率会降低辨别表征，从而影响性能。鉴于这些挑战，我们的论文旨在通过一种创新的特征学习方法来增强特征表征的判别能力。我们特别强调利用对比实例实现更独特的表征，从而实现有效的无人机跟踪。我们的方法无需手动注释，便于创建和部署轻量级模型。据我们所知，我们是在无人机跟踪应用中探索对比学习可能性的先驱。通过在 UAVDT、DTB70、UAV123@10fps 和 VisDrone2018 这四个无人机基准测试中进行广泛实验，我们证明了我们的 DRCI（具有对比性实例的判别表示）跟踪器优于当前最先进的无人机跟踪方法，凸显了其有效解决该领域长期挑战的潜力。

{"title":"Enhancing UAV tracking: a focus on discriminative representations using contrastive instances","authors":"Xucheng Wang, Dan Zeng, Yongxin Li, Mingliang Zou, Qijun Zhao, Shuiwang Li","doi":"10.1007/s11554-024-01456-2","DOIUrl":"https://doi.org/10.1007/s11554-024-01456-2","url":null,"abstract":"Addressing the core challenges of achieving both high efficiency and precision in UAV tracking is crucial due to limitations in computing resources, battery capacity, and maximum load capacity on UAVs. Discriminative correlation filter (DCF)-based trackers excel in efficiency on a single CPU but lag in precision. In contrast, many lightweight deep learning (DL)-based trackers based on model compression strike a better balance between efficiency and precision. However, higher compression rates can hinder performance by diminishing discriminative representations. Given these challenges, our paper aims to enhance feature representations’ discriminative abilities through an innovative feature-learning approach. We specifically emphasize leveraging contrasting instances to achieve more distinct representations for effective UAV tracking. Our method eliminates the need for manual annotations and facilitates the creation and deployment of lightweight models. As far as our knowledge goes, we are the pioneers in exploring the possibilities of contrastive learning in UAV tracking applications. Through extensive experimentation across four UAV benchmarks, namely, UAVDT, DTB70, UAV123@10fps and VisDrone2018, We have shown that our DRCI (discriminative representation with contrastive instances) tracker outperforms current state-of-the-art UAV tracking methods, underscoring its potential to effectively tackle the persistent challenges in this field.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"56 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel real-time pixel-level road crack segmentation network 新型实时像素级路面裂缝分割网络

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-20 DOI: 10.1007/s11554-024-01458-0

Rongdi Wang, Hao Wang, Zhenhao He, Jianchao Zhu, Haiqiang Zuo

Road crack detection plays a vital role in preserving the life of roads and ensuring driver safety. Traditional methods relying on manual observation have limitations in terms of subjectivity and inefficiency in quantifying damage. In recent years, advances in deep learning techniques have held promise for automated crack detection, but challenges, such as low contrast, small datasets, and inaccurate localization, remain. In this paper, we propose a deep learning-based pixel-level road crack segmentation network that achieves excellent performance on multiple datasets. In order to enrich the receptive fields of conventional convolutional modules, we design a residual asymmetric convolutional module for feature extraction. In addition to this, a multiple receptive field cascade module and a feature fusion module with non-local attention are proposed. Our network demonstrates superior accuracy and inference speed, achieving 55.60%, 59.01%, 75.65%, and 57.95% IoU on the CrackForest, CrackTree, CDD, and Crack500 datasets, respectively. It also has the ability to process 143 images per second. Experimental results and analysis validate the effectiveness of our approach. This work contributes to the advancement of road crack detection, providing a valuable tool for road maintenance and safety improvement.

道路裂缝检测在保护道路寿命和确保驾驶员安全方面起着至关重要的作用。依靠人工观测的传统方法在量化损坏方面存在主观性和低效率的局限性。近年来，深度学习技术的进步为裂缝自动检测带来了希望，但低对比度、小数据集和定位不准确等挑战依然存在。在本文中，我们提出了一种基于深度学习的像素级道路裂缝分割网络，该网络在多个数据集上都取得了优异的性能。为了丰富传统卷积模块的感受野，我们设计了一个用于特征提取的残差非对称卷积模块。除此之外，我们还提出了一个多感受野级联模块和一个具有非局部注意力的特征融合模块。我们的网络展示了卓越的准确性和推理速度，在 CrackForest、CrackTree、CDD 和 Crack500 数据集上分别实现了 55.60%、59.01%、75.65% 和 57.95% 的 IoU。它每秒还能处理 143 幅图像。实验结果和分析验证了我们方法的有效性。这项工作有助于推动道路裂缝检测的发展，为道路维护和安全改善提供有价值的工具。

{"title":"A novel real-time pixel-level road crack segmentation network","authors":"Rongdi Wang, Hao Wang, Zhenhao He, Jianchao Zhu, Haiqiang Zuo","doi":"10.1007/s11554-024-01458-0","DOIUrl":"https://doi.org/10.1007/s11554-024-01458-0","url":null,"abstract":"Road crack detection plays a vital role in preserving the life of roads and ensuring driver safety. Traditional methods relying on manual observation have limitations in terms of subjectivity and inefficiency in quantifying damage. In recent years, advances in deep learning techniques have held promise for automated crack detection, but challenges, such as low contrast, small datasets, and inaccurate localization, remain. In this paper, we propose a deep learning-based pixel-level road crack segmentation network that achieves excellent performance on multiple datasets. In order to enrich the receptive fields of conventional convolutional modules, we design a residual asymmetric convolutional module for feature extraction. In addition to this, a multiple receptive field cascade module and a feature fusion module with non-local attention are proposed. Our network demonstrates superior accuracy and inference speed, achieving 55.60%, 59.01%, 75.65%, and 57.95% IoU on the CrackForest, CrackTree, CDD, and Crack500 datasets, respectively. It also has the ability to process 143 images per second. Experimental results and analysis validate the effectiveness of our approach. This work contributes to the advancement of road crack detection, providing a valuable tool for road maintenance and safety improvement.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"8 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140629057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved feature extraction network in lightweight YOLOv7 model for real-time vehicle detection on low-cost hardware 改进轻量级 YOLOv7 模型中的特征提取网络，在低成本硬件上实现实时车辆检测

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-20 DOI: 10.1007/s11554-024-01457-1

Johan Lela Andika, Anis Salwa Mohd Khairuddin, Harikrishnan Ramiah, Jeevan Kanesan

The advancement of unmanned aerial vehicles (UAVs) has drawn researchers to update object detection algorithms for better accuracy and computation performance. Previous works applying deep learning models for object detection applications required high graphics processing unit (GPU) computation power. Generally, object detection models suffer trade-off between accuracy and model size where the relationship is not always linear in deep learning models. Various factors such as architectural design, optimization techniques, and dataset characteristics can significantly influence the accuracy, model size, and computation cost in adopting object detection models for low-cost embedded devices. Hence, it is crucial to employ lightweight object detection models for real-time object identification for the solution to be sustainable. In this work, an improved feature extraction network is proposed by incorporating an efficient long-range aggregation network for vehicle detection (ELAN-VD) in the backbone layer. The architecture improvement in YOLOv7-tiny model is proposed to improve the accuracy of detecting small vehicles in the aerial image. Besides that, the image size output of the second and third prediction boxes is upscaled for better performance. This study showed that the proposed method yields a mean average precision (mAP) of 57.94%, which is higher than that of the conventional YOLOv7-tiny. In addition, the proposed model showed significant performance when compared to previous works, making it viable for application in low-cost embedded devices.

无人驾驶飞行器（UAV）的发展促使研究人员更新物体检测算法，以提高精度和计算性能。以往应用深度学习模型进行物体检测的工作需要较高的图形处理器（GPU）计算能力。一般来说，物体检测模型需要在精度和模型大小之间进行权衡，而在深度学习模型中，两者之间的关系并不总是线性的。在为低成本嵌入式设备采用物体检测模型时，架构设计、优化技术和数据集特性等各种因素都会对精度、模型大小和计算成本产生重大影响。因此，采用轻量级物体检测模型进行实时物体识别对于解决方案的可持续性至关重要。本研究提出了一种改进的特征提取网络，在骨干层中加入了用于车辆检测的高效远距离聚合网络（ELAN-VD）。对 YOLOv7-tiny 模型的架构进行了改进，以提高航空图像中小型车辆的检测精度。此外，为了获得更好的性能，第二和第三预测框的图像尺寸输出被放大。研究表明，所提方法的平均精度（mAP）为 57.94%，高于传统的 YOLOv7-tiny。此外，与之前的研究相比，所提出的模型表现出了显著的性能，使其在低成本嵌入式设备中的应用变得可行。

{"title":"Improved feature extraction network in lightweight YOLOv7 model for real-time vehicle detection on low-cost hardware","authors":"Johan Lela Andika, Anis Salwa Mohd Khairuddin, Harikrishnan Ramiah, Jeevan Kanesan","doi":"10.1007/s11554-024-01457-1","DOIUrl":"https://doi.org/10.1007/s11554-024-01457-1","url":null,"abstract":"The advancement of unmanned aerial vehicles (UAVs) has drawn researchers to update object detection algorithms for better accuracy and computation performance. Previous works applying deep learning models for object detection applications required high graphics processing unit (GPU) computation power. Generally, object detection models suffer trade-off between accuracy and model size where the relationship is not always linear in deep learning models. Various factors such as architectural design, optimization techniques, and dataset characteristics can significantly influence the accuracy, model size, and computation cost in adopting object detection models for low-cost embedded devices. Hence, it is crucial to employ lightweight object detection models for real-time object identification for the solution to be sustainable. In this work, an improved feature extraction network is proposed by incorporating an efficient long-range aggregation network for vehicle detection (ELAN-VD) in the backbone layer. The architecture improvement in YOLOv7-tiny model is proposed to improve the accuracy of detecting small vehicles in the aerial image. Besides that, the image size output of the second and third prediction boxes is upscaled for better performance. This study showed that the proposed method yields a mean average precision (mAP) of 57.94%, which is higher than that of the conventional YOLOv7-tiny. In addition, the proposed model showed significant performance when compared to previous works, making it viable for application in low-cost embedded devices.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"1 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Driver fatigue detection based on improved YOLOv7 基于改进型 YOLOv7 的驾驶员疲劳检测

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-13 DOI: 10.1007/s11554-024-01455-3

Xianguo Li, Xueyan Li, Zhenqian Shen, Guangmin Qian

Fatigue driving is one of the main reasons threatening road traffic safety. Aiming at the problems of complex detection process, low accuracy, and susceptibility to light interference in the current driver fatigue detection algorithm, this paper proposes a driver Eye State detection algorithm based on YOLO, abbreviated as ES-YOLO. The algorithm optimizes the structure of YOLOv7, integrates the multi-scale features using the convolutional block attention mechanism (CBAM), and improves the attention to important spatial locations in the image. Furthermore, using the Focal-EIOU Loss instead of CIOU Loss to increase the attention on difficult samples and reduce the influence of sample class imbalance. Then, based on ES-YOLO, a driver fatigue detection method is proposed, and the driver fatigue judgment logic is designed to monitor the fatigue state in real-time and alarm in time to improve the accuracy of detection. The experiments on the public dataset CEW and the self-made dataset show that the proposed ES-YOLO obtained 99.0% and 98.8% mAP values, respectively, which are better than the compared algorithms. And this method achieves real-time and accurate detection of driver fatigue status. Source code is released in https://www.github/driver-fatigue-detection.git.

疲劳驾驶是威胁道路交通安全的主要原因之一。针对目前驾驶员疲劳检测算法中存在的检测过程复杂、准确率低、易受光线干扰等问题，本文提出了一种基于 YOLO 的驾驶员眼部状态检测算法，简称 ES-YOLO。该算法优化了 YOLOv7 的结构，利用卷积块注意力机制（CBAM）整合了多尺度特征，提高了对图像中重要空间位置的注意力。此外，使用 Focal-EIOU Loss 代替 CIOU Loss 来提高对困难样本的关注度，减少样本类别不平衡的影响。然后，基于 ES-YOLO 提出了一种驾驶员疲劳检测方法，并设计了驾驶员疲劳判断逻辑，实时监测疲劳状态并及时报警，提高了检测的准确性。在公共数据集CEW和自建数据集上的实验表明，所提出的ES-YOLO分别获得了99.0%和98.8%的mAP值，优于对比算法。该方法实现了对驾驶员疲劳状态的实时、准确检测。源代码发布于 https://www.github/driver-fatigue-detection.git。

{"title":"Driver fatigue detection based on improved YOLOv7","authors":"Xianguo Li, Xueyan Li, Zhenqian Shen, Guangmin Qian","doi":"10.1007/s11554-024-01455-3","DOIUrl":"https://doi.org/10.1007/s11554-024-01455-3","url":null,"abstract":"Fatigue driving is one of the main reasons threatening road traffic safety. Aiming at the problems of complex detection process, low accuracy, and susceptibility to light interference in the current driver fatigue detection algorithm, this paper proposes a driver Eye State detection algorithm based on YOLO, abbreviated as ES-YOLO. The algorithm optimizes the structure of YOLOv7, integrates the multi-scale features using the convolutional block attention mechanism (CBAM), and improves the attention to important spatial locations in the image. Furthermore, using the Focal-EIOU Loss instead of CIOU Loss to increase the attention on difficult samples and reduce the influence of sample class imbalance. Then, based on ES-YOLO, a driver fatigue detection method is proposed, and the driver fatigue judgment logic is designed to monitor the fatigue state in real-time and alarm in time to improve the accuracy of detection. The experiments on the public dataset CEW and the self-made dataset show that the proposed ES-YOLO obtained 99.0% and 98.8% mAP values, respectively, which are better than the compared algorithms. And this method achieves real-time and accurate detection of driver fatigue status. Source code is released in https://www.github/driver-fatigue-detection.git.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"301 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time semantic segmentation network based on parallel atrous convolution for short-term dense concatenate and attention feature fusion 基于并行无绳卷积的实时语义分割网络，用于短期密集串联和注意力特征融合

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-10 DOI: 10.1007/s11554-024-01453-5

Lijun Wu, Shangdong Qiu, Zhicong Chen

To address the problem of incomplete segmentation of large objects and miss-segmentation of tiny objects that is universally existing in semantic segmentation algorithms, PACAMNet, a real-time segmentation network based on short-term dense concatenate of parallel atrous convolution and fusion of attentional features is proposed, called PACAMNet. First, parallel atrous convolution is introduced to improve the short-term dense concatenate module. By adjusting the size of the atrous factor, multi-scale semantic information is obtained to ensure that the last layer of the module can also obtain rich input feature maps. Second, attention feature fusion module is proposed to align the receptive fields of deep and shallow feature maps via depth-separable convolutions with different sizes, and the channel attention mechanism is used to generate weights to effectively fuse the deep and shallow feature maps. Finally, experiments are carried out based on both Cityscapes and CamVid datasets, and the segmentation accuracy achieve 77.4% and 74.0% at the inference speeds of 98.7 FPS and 134.6 FPS, respectively. Compared with other methods, PACAMNet improves the inference speed of the model while ensuring higher segmentation accuracy, so PACAMNet achieve a better balance between segmentation accuracy and inference speed.

针对语义分割算法中普遍存在的大型物体分割不完整和微小物体分割错误的问题，我们提出了一种基于并行阿特罗斯卷积的短期密集串联和注意力特征融合的实时分割网络，称为 PACAMNet。首先，引入并行阿特罗斯卷积来改进短期密集串联模块。通过调整atrous因子的大小，可以获得多尺度的语义信息，确保模块的最后一层也能获得丰富的输入特征图。其次，提出了注意力特征融合模块，通过不同大小的深度分离卷积来对齐深层和浅层特征图的感受野，并利用通道注意力机制生成权重，从而有效地融合深层和浅层特征图。最后，基于 Cityscapes 和 CamVid 数据集进行了实验，在推理速度分别为 98.7 FPS 和 134.6 FPS 的情况下，分割准确率分别达到了 77.4% 和 74.0%。与其他方法相比，PACAMNet 在提高模型推理速度的同时，也保证了较高的分割精度，因此 PACAMNet 在分割精度和推理速度之间取得了较好的平衡。

{"title":"Real-time semantic segmentation network based on parallel atrous convolution for short-term dense concatenate and attention feature fusion","authors":"Lijun Wu, Shangdong Qiu, Zhicong Chen","doi":"10.1007/s11554-024-01453-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01453-5","url":null,"abstract":"To address the problem of incomplete segmentation of large objects and miss-segmentation of tiny objects that is universally existing in semantic segmentation algorithms, PACAMNet, a real-time segmentation network based on short-term dense concatenate of parallel atrous convolution and fusion of attentional features is proposed, called PACAMNet. First, parallel atrous convolution is introduced to improve the short-term dense concatenate module. By adjusting the size of the atrous factor, multi-scale semantic information is obtained to ensure that the last layer of the module can also obtain rich input feature maps. Second, attention feature fusion module is proposed to align the receptive fields of deep and shallow feature maps via depth-separable convolutions with different sizes, and the channel attention mechanism is used to generate weights to effectively fuse the deep and shallow feature maps. Finally, experiments are carried out based on both Cityscapes and CamVid datasets, and the segmentation accuracy achieve 77.4% and 74.0% at the inference speeds of 98.7 FPS and 134.6 FPS, respectively. Compared with other methods, PACAMNet improves the inference speed of the model while ensuring higher segmentation accuracy, so PACAMNet achieve a better balance between segmentation accuracy and inference speed.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"105 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

F2S-Net: learning frame-to-segment prediction for online action detection F2S-Net：学习帧到段的预测，实现在线动作检测

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-10 DOI: 10.1007/s11554-024-01454-4

Yi Liu, Yu Qiao, Yali Wang

Online action detection (OAD) aims at predicting action per frame from a streaming untrimmed video in real time. Most existing approaches leverage all the historical frames in the sliding window as the temporal context of the current frame since single-frame prediction is often unreliable. However, such a manner inevitably introduces useless even noisy video content, which often misleads action classifier when recognizing the ongoing action in the current frame. To alleviate this difficulty, we propose a concise and novel F2S-Net, which can adaptively discover the contextual segments in the online sliding window, and convert current frame prediction into relevant-segment prediction. More specifically, as the current frame can be either action or background, we develop F2S-Net with a distinct two-branch structure, i.e., the action (or background) branch can exploit the action (or background) segments. Via multi-level action supervision, these two branches can complementarily enhance each other, allowing to identify the contextual segments in the sliding window to robustly predict what is ongoing. We evaluate our approach on popular OAD benchmarks, i.e., THUMOS-14, TVSeries and HDD. The extensive results show that our F2S-Net outperforms the recent state-of-the-art approaches.

在线动作检测（OAD）旨在从未修改的流视频中实时预测每帧的动作。由于单帧预测往往不可靠，因此大多数现有方法都利用滑动窗口中的所有历史帧作为当前帧的时间背景。然而，这种方式不可避免地会引入无用甚至有噪声的视频内容，在识别当前帧中正在进行的动作时往往会误导动作分类器。为了缓解这一难题，我们提出了一种简洁而新颖的 F2S-Net，它可以自适应地发现在线滑动窗口中的上下文片段，并将当前帧预测转换为相关片段预测。更具体地说，由于当前帧可以是动作帧或背景帧，我们开发的 F2S-Net 具有明显的双分支结构，即动作（或背景）分支可以利用动作（或背景）片段。通过多级动作监督，这两个分支可以互补增强，从而识别滑动窗口中的上下文片段，稳健地预测正在进行的内容。我们在常用的 OAD 基准（即 THUMOS-14、TVSeries 和 HDD）上对我们的方法进行了评估。大量结果表明，我们的 F2S-Net 优于最新的先进方法。

{"title":"F2S-Net: learning frame-to-segment prediction for online action detection","authors":"Yi Liu, Yu Qiao, Yali Wang","doi":"10.1007/s11554-024-01454-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01454-4","url":null,"abstract":"Online action detection (OAD) aims at predicting action per frame from a streaming untrimmed video in real time. Most existing approaches leverage all the historical frames in the sliding window as the temporal context of the current frame since single-frame prediction is often unreliable. However, such a manner inevitably introduces useless even noisy video content, which often misleads action classifier when recognizing the ongoing action in the current frame. To alleviate this difficulty, we propose a concise and novel F2S-Net, which can adaptively discover the contextual segments in the online sliding window, and convert current frame prediction into relevant-segment prediction. More specifically, as the current frame can be either action or background, we develop F2S-Net with a distinct two-branch structure, i.e., the action (or background) branch can exploit the action (or background) segments. Via multi-level action supervision, these two branches can complementarily enhance each other, allowing to identify the contextual segments in the sliding window to robustly predict what is ongoing. We evaluate our approach on popular OAD benchmarks, i.e., THUMOS-14, TVSeries and HDD. The extensive results show that our F2S-Net outperforms the recent state-of-the-art approaches.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"22 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A safety helmet-wearing detection method based on cross-layer connection 基于跨层连接的安全帽佩戴检测方法

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-09 DOI: 10.1007/s11554-024-01437-5

Gang Dong, Yefei Zhang, Weicheng Xie, Yong Huang

Given the current safety helmet detection methods, the feature information of the small-scale safety helmet will be lost after the network model is convolved many times, resulting in the problem of missing detection of the safety helmet. To this end, an improved target detection algorithm of YOLOv5 is used to detect the wearing of safety helmets. Firstly, a new small-scale detection layer is added to the head of the network for multi-scale feature fusion, thereby increasing the receptive field area of the feature map to improve the model’s recognition of small targets. Secondly, a cross-layer connection is designed between the feature extraction network and the feature fusion network to enhance the fine-grained features of the target in the shallow layer of the network. Thirdly, a coordinate attention (CA) module is added to the cross-layer connection to capture the global information of the image and improve the localization ability of the target. Finally, the Normalized Wasserstein Distance (NWD) is used to measure the similarity between bounding boxes, replacing the intersection over union (IoU) method. The experimental results show that the improved model achieves 95.09% of the mAP value for safety helmet-wearing detection, which has a good effect on the recognition of small-sized safety helmets of different degrees in the construction work scene.

鉴于目前的安全头盔检测方法，小尺寸安全头盔的特征信息会在网络模型多次卷积后丢失，从而导致安全头盔检测缺失的问题。为此，采用 YOLOv5 的改进目标检测算法来检测安全头盔的佩戴情况。首先，在网络头部增加了一个新的小尺度检测层，用于多尺度特征融合，从而增加特征图的感受野面积，提高模型对小目标的识别能力。其次，在特征提取网络和特征融合网络之间设计了跨层连接，以增强网络浅层中目标的细粒度特征。第三，在跨层连接中加入坐标注意（CA）模块，捕捉图像的全局信息，提高目标定位能力。最后，使用归一化瓦瑟斯坦距离（NWD）来测量边界框之间的相似性，取代了交集大于联合（IoU）方法。实验结果表明，改进后的模型在安全帽佩戴检测方面的 mAP 值达到了 95.09%，对建筑施工场景中不同程度的小型安全帽的识别有很好的效果。

{"title":"A safety helmet-wearing detection method based on cross-layer connection","authors":"Gang Dong, Yefei Zhang, Weicheng Xie, Yong Huang","doi":"10.1007/s11554-024-01437-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01437-5","url":null,"abstract":"Given the current safety helmet detection methods, the feature information of the small-scale safety helmet will be lost after the network model is convolved many times, resulting in the problem of missing detection of the safety helmet. To this end, an improved target detection algorithm of YOLOv5 is used to detect the wearing of safety helmets. Firstly, a new small-scale detection layer is added to the head of the network for multi-scale feature fusion, thereby increasing the receptive field area of the feature map to improve the model’s recognition of small targets. Secondly, a cross-layer connection is designed between the feature extraction network and the feature fusion network to enhance the fine-grained features of the target in the shallow layer of the network. Thirdly, a coordinate attention (CA) module is added to the cross-layer connection to capture the global information of the image and improve the localization ability of the target. Finally, the Normalized Wasserstein Distance (NWD) is used to measure the similarity between bounding boxes, replacing the intersection over union (IoU) method. The experimental results show that the improved model achieves 95.09% of the mAP value for safety helmet-wearing detection, which has a good effect on the recognition of small-sized safety helmets of different degrees in the construction work scene.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"19 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Equivalent convolution strategy for the evolution computation in parametric active contour model 参数主动轮廓模型演化计算的等效卷积策略

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing

Pub Date : 2024-04-05 DOI: 10.1007/s11554-024-01434-8

Kelun Tang, Lin Lang, Xiaojun Zhou

Parametric active contour model is an efficient approach for image segmentation. However, the high cost of evolution computation has restricted their potential applications to contour segmentation with long perimeter. Extensive algorithm debugging and analysis indicate that the inverse matrix calculation and the matrix multiplication are the two major reasons. In this paper, a novel simple and efficient algorithm for evolution computation is proposed. Motivated by the relationship between the eigenvalues and the entries in the circular Toeplitz matrix, each entry expression of inverse matrix is firstly derived through mathematical deduction, and then, the matrix multiplication is simplified into a more efficient convolution operation. Experimental results show that the proposed algorithm can significantly improve the computational speed by one to two orders of magnitude and is even more efficient for contour extraction with large perimeter.

参数主动轮廓模型是一种高效的图像分割方法。然而，进化计算的高成本限制了其在长周长轮廓分割中的潜在应用。大量的算法调试和分析表明，逆矩阵计算和矩阵乘法是两个主要原因。本文提出了一种简单高效的新型进化计算算法。根据圆托普利兹矩阵中特征值与条目的关系，首先通过数学推导得出逆矩阵的每个条目表达式，然后将矩阵乘法简化为更高效的卷积运算。实验结果表明，所提出的算法能将计算速度显著提高一到两个数量级，对于大周长轮廓提取更为高效。

引用次数: 0