Pub Date : 2024-05-03DOI: 10.1007/s11554-024-01462-4
Shuai Feng, Huaming Qian, Huilin Wang, Wenna Wang
Deep learning-based object detection methods often grapple with excessive model parameters, high complexity, and subpar real-time performance. In response, the YOLO series, particularly the YOLOv5s to YOLOv8s methods, has been developed by scholars to strike a balance between real-time processing and accuracy. Nevertheless, YOLOv8’s precision can fall short in certain specific applications. To address this, we introduce a real-time object detection method called (eta)-RepYOLO, which is built upon the (eta)-RepConv structure. This method is designed to maintain consistent detection speeds while improving accuracy. We begin by crafting a backbone network named (eta)-EfficientRep, which utilizes a strategically designed network unit-(eta)-RepConv and (eta)-RepC2f module, to reparameterize and subsequently generate an efficient inference model. This model achieves superior performance by extracting detailed feature maps from images. Subsequently, we propose the enhanced (eta)-RepPANet and (eta)-RepAFPN as the model’s detection neck, with the addition of the (eta)-RepC2f for optimized feature fusion, thus boosting the neck’s functionality. Our innovation continues with the development of an advanced decoupled head for detection, where the (eta)-RepConv takes the place of the traditional (3 times 3) conv, resulting in a marked increase in detection precision during the inference stage. Our proposed (eta)-RepYOLO method, when applied to distinct neck modules, (eta)-RepPANet and (eta)-RepAFPN, achieves mAP of 84.77%/85.65% on the PASCAL VOC07+12 dataset and AP of 45.3%/45.8% on the MSCOCO dataset, respectively. These figures represent a significant advancement over the YOLOv8s method. Additionally, the model parameters for (eta)-RepYOLO are reduced to 10.8M/8.8M, which is 3.6%/21.4% less than that of YOLOv8, culminating in a more streamlined detection model. The detection speeds clocked on an RTX3060 are 116 FPS/81 FPS, showcasing a substantial enhancement in comparison to YOLOv8s. In summary, our approach delivers competitive performance and presents a more lightweight alternative to the SOTA YOLO models, making it a robust choice for real-time object detection applications.
{"title":"$$eta$$ -repyolo: real-time object detection method based on $$eta$$ -RepConv and YOLOv8","authors":"Shuai Feng, Huaming Qian, Huilin Wang, Wenna Wang","doi":"10.1007/s11554-024-01462-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01462-4","url":null,"abstract":"<p>Deep learning-based object detection methods often grapple with excessive model parameters, high complexity, and subpar real-time performance. In response, the YOLO series, particularly the YOLOv5s to YOLOv8s methods, has been developed by scholars to strike a balance between real-time processing and accuracy. Nevertheless, YOLOv8’s precision can fall short in certain specific applications. To address this, we introduce a real-time object detection method called <span>(eta)</span>-RepYOLO, which is built upon the <span>(eta)</span>-RepConv structure. This method is designed to maintain consistent detection speeds while improving accuracy. We begin by crafting a backbone network named <span>(eta)</span>-EfficientRep, which utilizes a strategically designed network unit-<span>(eta)</span>-RepConv and <span>(eta)</span>-RepC2f module, to reparameterize and subsequently generate an efficient inference model. This model achieves superior performance by extracting detailed feature maps from images. Subsequently, we propose the enhanced <span>(eta)</span>-RepPANet and <span>(eta)</span>-RepAFPN as the model’s detection neck, with the addition of the <span>(eta)</span>-RepC2f for optimized feature fusion, thus boosting the neck’s functionality. Our innovation continues with the development of an advanced decoupled head for detection, where the <span>(eta)</span>-RepConv takes the place of the traditional <span>(3 times 3)</span> conv, resulting in a marked increase in detection precision during the inference stage. Our proposed <span>(eta)</span>-RepYOLO method, when applied to distinct neck modules, <span>(eta)</span>-RepPANet and <span>(eta)</span>-RepAFPN, achieves mAP of 84.77%/85.65% on the PASCAL VOC07+12 dataset and AP of 45.3%/45.8% on the MSCOCO dataset, respectively. These figures represent a significant advancement over the YOLOv8s method. Additionally, the model parameters for <span>(eta)</span>-RepYOLO are reduced to 10.8M/8.8M, which is 3.6%/21.4% less than that of YOLOv8, culminating in a more streamlined detection model. The detection speeds clocked on an RTX3060 are 116 FPS/81 FPS, showcasing a substantial enhancement in comparison to YOLOv8s. In summary, our approach delivers competitive performance and presents a more lightweight alternative to the SOTA YOLO models, making it a robust choice for real-time object detection applications.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"31 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140882430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Instance segmentation of foods is an important technology to ensure the food success rate of meal-assisting robotics. However, due to foods have strong intraclass variability, interclass similarity, and complex physical properties, which leads to more challenges in recognition, localization, and contour acquisition of foods. To address the above issues, this paper proposed a novel method for instance segmentation of foods. Specifically, in backbone network, deformable convolution was introduced to enhance the ability of YOLOv8 architecture to capture finer-grained spatial information, and efficient multiscale attention based on cross-spatial learning was introduced to improve sensitivity and expressiveness of multiscale inputs. In neck network, classical convolution and C2f modules were replaced by lightweight convolution GSConv and improved VoV-GSCSP aggregation module, respectively, to improve inference speed of models. We abbreviated it as the DEG-YOLOv8n-seg model. The proposed method was compared with baseline model and several state-of-the-art (SOTA) segmentation models on datasets, respectively. The results show that the DEG-YOLOv8n-seg model has higher accuracy, faster speed, and stronger robustness. Specifically, the DEG-YOLOv8n-seg model can achieve 84.6% Box_mAP@0.5 and 84.1% Mask_mAP@0.5 accuracy at 55.2 FPS and 11.1 GFLOPs. The importance of adopting data augmentation and the effectiveness of introducing deformable convolution, EMA, and VoV-GSCSP were verified by ablation experiments. Finally, the DEG-YOLOv8n-seg model was applied to experiments of food instance segmentation for meal-assisting robots. The results show that the DEG-YOLOv8n-seg can achieve better instance segmentation of foods. This work can promote the development of intelligent meal-assisting robotics technology and can provide theoretical foundations for other tasks of the computer vision field with some reference value.
{"title":"Real-time and accurate model of instance segmentation of foods","authors":"Yuhe Fan, Lixun Zhang, Canxing Zheng, Yunqin Zu, Keyi Wang, Xingyuan Wang","doi":"10.1007/s11554-024-01459-z","DOIUrl":"https://doi.org/10.1007/s11554-024-01459-z","url":null,"abstract":"<p>Instance segmentation of foods is an important technology to ensure the food success rate of meal-assisting robotics. However, due to foods have strong intraclass variability, interclass similarity, and complex physical properties, which leads to more challenges in recognition, localization, and contour acquisition of foods. To address the above issues, this paper proposed a novel method for instance segmentation of foods. Specifically, in backbone network, deformable convolution was introduced to enhance the ability of YOLOv8 architecture to capture finer-grained spatial information, and efficient multiscale attention based on cross-spatial learning was introduced to improve sensitivity and expressiveness of multiscale inputs. In neck network, classical convolution and C2f modules were replaced by lightweight convolution GSConv and improved VoV-GSCSP aggregation module, respectively, to improve inference speed of models. We abbreviated it as the DEG-YOLOv8n-seg model. The proposed method was compared with baseline model and several state-of-the-art (SOTA) segmentation models on datasets, respectively. The results show that the DEG-YOLOv8n-seg model has higher accuracy, faster speed, and stronger robustness. Specifically, the DEG-YOLOv8n-seg model can achieve 84.6% Box_mAP@0.5 and 84.1% Mask_mAP@0.5 accuracy at 55.2 FPS and 11.1 GFLOPs. The importance of adopting data augmentation and the effectiveness of introducing deformable convolution, EMA, and VoV-GSCSP were verified by ablation experiments. Finally, the DEG-YOLOv8n-seg model was applied to experiments of food instance segmentation for meal-assisting robots. The results show that the DEG-YOLOv8n-seg can achieve better instance segmentation of foods. This work can promote the development of intelligent meal-assisting robotics technology and can provide theoretical foundations for other tasks of the computer vision field with some reference value.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"21 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140829793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-21DOI: 10.1007/s11554-024-01456-2
Xucheng Wang, Dan Zeng, Yongxin Li, Mingliang Zou, Qijun Zhao, Shuiwang Li
Addressing the core challenges of achieving both high efficiency and precision in UAV tracking is crucial due to limitations in computing resources, battery capacity, and maximum load capacity on UAVs. Discriminative correlation filter (DCF)-based trackers excel in efficiency on a single CPU but lag in precision. In contrast, many lightweight deep learning (DL)-based trackers based on model compression strike a better balance between efficiency and precision. However, higher compression rates can hinder performance by diminishing discriminative representations. Given these challenges, our paper aims to enhance feature representations’ discriminative abilities through an innovative feature-learning approach. We specifically emphasize leveraging contrasting instances to achieve more distinct representations for effective UAV tracking. Our method eliminates the need for manual annotations and facilitates the creation and deployment of lightweight models. As far as our knowledge goes, we are the pioneers in exploring the possibilities of contrastive learning in UAV tracking applications. Through extensive experimentation across four UAV benchmarks, namely, UAVDT, DTB70, UAV123@10fps and VisDrone2018, We have shown that our DRCI (discriminative representation with contrastive instances) tracker outperforms current state-of-the-art UAV tracking methods, underscoring its potential to effectively tackle the persistent challenges in this field.
由于受到计算资源、电池容量和无人机最大负载能力的限制,解决无人机跟踪中实现高效率和高精度的核心难题至关重要。基于判别相关滤波器(DCF)的跟踪器在单个 CPU 上具有出色的效率,但在精度方面却相对落后。相比之下,许多基于模型压缩的轻量级深度学习(DL)跟踪器在效率和精度之间取得了更好的平衡。然而,较高的压缩率会降低辨别表征,从而影响性能。鉴于这些挑战,我们的论文旨在通过一种创新的特征学习方法来增强特征表征的判别能力。我们特别强调利用对比实例实现更独特的表征,从而实现有效的无人机跟踪。我们的方法无需手动注释,便于创建和部署轻量级模型。据我们所知,我们是在无人机跟踪应用中探索对比学习可能性的先驱。通过在 UAVDT、DTB70、UAV123@10fps 和 VisDrone2018 这四个无人机基准测试中进行广泛实验,我们证明了我们的 DRCI(具有对比性实例的判别表示)跟踪器优于当前最先进的无人机跟踪方法,凸显了其有效解决该领域长期挑战的潜力。
{"title":"Enhancing UAV tracking: a focus on discriminative representations using contrastive instances","authors":"Xucheng Wang, Dan Zeng, Yongxin Li, Mingliang Zou, Qijun Zhao, Shuiwang Li","doi":"10.1007/s11554-024-01456-2","DOIUrl":"https://doi.org/10.1007/s11554-024-01456-2","url":null,"abstract":"<p>Addressing the core challenges of achieving both high efficiency and precision in UAV tracking is crucial due to limitations in computing resources, battery capacity, and maximum load capacity on UAVs. Discriminative correlation filter (DCF)-based trackers excel in efficiency on a single CPU but lag in precision. In contrast, many lightweight deep learning (DL)-based trackers based on model compression strike a better balance between efficiency and precision. However, higher compression rates can hinder performance by diminishing discriminative representations. Given these challenges, our paper aims to enhance feature representations’ discriminative abilities through an innovative feature-learning approach. We specifically emphasize leveraging contrasting instances to achieve more distinct representations for effective UAV tracking. Our method eliminates the need for manual annotations and facilitates the creation and deployment of lightweight models. As far as our knowledge goes, we are the pioneers in exploring the possibilities of contrastive learning in UAV tracking applications. Through extensive experimentation across four UAV benchmarks, namely, UAVDT, DTB70, UAV123@10fps and VisDrone2018, We have shown that our DRCI (discriminative representation with contrastive instances) tracker outperforms current state-of-the-art UAV tracking methods, underscoring its potential to effectively tackle the persistent challenges in this field.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"56 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Road crack detection plays a vital role in preserving the life of roads and ensuring driver safety. Traditional methods relying on manual observation have limitations in terms of subjectivity and inefficiency in quantifying damage. In recent years, advances in deep learning techniques have held promise for automated crack detection, but challenges, such as low contrast, small datasets, and inaccurate localization, remain. In this paper, we propose a deep learning-based pixel-level road crack segmentation network that achieves excellent performance on multiple datasets. In order to enrich the receptive fields of conventional convolutional modules, we design a residual asymmetric convolutional module for feature extraction. In addition to this, a multiple receptive field cascade module and a feature fusion module with non-local attention are proposed. Our network demonstrates superior accuracy and inference speed, achieving 55.60%, 59.01%, 75.65%, and 57.95% IoU on the CrackForest, CrackTree, CDD, and Crack500 datasets, respectively. It also has the ability to process 143 images per second. Experimental results and analysis validate the effectiveness of our approach. This work contributes to the advancement of road crack detection, providing a valuable tool for road maintenance and safety improvement.
{"title":"A novel real-time pixel-level road crack segmentation network","authors":"Rongdi Wang, Hao Wang, Zhenhao He, Jianchao Zhu, Haiqiang Zuo","doi":"10.1007/s11554-024-01458-0","DOIUrl":"https://doi.org/10.1007/s11554-024-01458-0","url":null,"abstract":"<p>Road crack detection plays a vital role in preserving the life of roads and ensuring driver safety. Traditional methods relying on manual observation have limitations in terms of subjectivity and inefficiency in quantifying damage. In recent years, advances in deep learning techniques have held promise for automated crack detection, but challenges, such as low contrast, small datasets, and inaccurate localization, remain. In this paper, we propose a deep learning-based pixel-level road crack segmentation network that achieves excellent performance on multiple datasets. In order to enrich the receptive fields of conventional convolutional modules, we design a residual asymmetric convolutional module for feature extraction. In addition to this, a multiple receptive field cascade module and a feature fusion module with non-local attention are proposed. Our network demonstrates superior accuracy and inference speed, achieving 55.60%, 59.01%, 75.65%, and 57.95% IoU on the CrackForest, CrackTree, CDD, and Crack500 datasets, respectively. It also has the ability to process 143 images per second. Experimental results and analysis validate the effectiveness of our approach. This work contributes to the advancement of road crack detection, providing a valuable tool for road maintenance and safety improvement.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"8 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140629057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-20DOI: 10.1007/s11554-024-01457-1
Johan Lela Andika, Anis Salwa Mohd Khairuddin, Harikrishnan Ramiah, Jeevan Kanesan
The advancement of unmanned aerial vehicles (UAVs) has drawn researchers to update object detection algorithms for better accuracy and computation performance. Previous works applying deep learning models for object detection applications required high graphics processing unit (GPU) computation power. Generally, object detection models suffer trade-off between accuracy and model size where the relationship is not always linear in deep learning models. Various factors such as architectural design, optimization techniques, and dataset characteristics can significantly influence the accuracy, model size, and computation cost in adopting object detection models for low-cost embedded devices. Hence, it is crucial to employ lightweight object detection models for real-time object identification for the solution to be sustainable. In this work, an improved feature extraction network is proposed by incorporating an efficient long-range aggregation network for vehicle detection (ELAN-VD) in the backbone layer. The architecture improvement in YOLOv7-tiny model is proposed to improve the accuracy of detecting small vehicles in the aerial image. Besides that, the image size output of the second and third prediction boxes is upscaled for better performance. This study showed that the proposed method yields a mean average precision (mAP) of 57.94%, which is higher than that of the conventional YOLOv7-tiny. In addition, the proposed model showed significant performance when compared to previous works, making it viable for application in low-cost embedded devices.
{"title":"Improved feature extraction network in lightweight YOLOv7 model for real-time vehicle detection on low-cost hardware","authors":"Johan Lela Andika, Anis Salwa Mohd Khairuddin, Harikrishnan Ramiah, Jeevan Kanesan","doi":"10.1007/s11554-024-01457-1","DOIUrl":"https://doi.org/10.1007/s11554-024-01457-1","url":null,"abstract":"<p>The advancement of unmanned aerial vehicles (UAVs) has drawn researchers to update object detection algorithms for better accuracy and computation performance. Previous works applying deep learning models for object detection applications required high graphics processing unit (GPU) computation power. Generally, object detection models suffer trade-off between accuracy and model size where the relationship is not always linear in deep learning models. Various factors such as architectural design, optimization techniques, and dataset characteristics can significantly influence the accuracy, model size, and computation cost in adopting object detection models for low-cost embedded devices. Hence, it is crucial to employ lightweight object detection models for real-time object identification for the solution to be sustainable. In this work, an improved feature extraction network is proposed by incorporating an efficient long-range aggregation network for vehicle detection (ELAN-VD) in the backbone layer. The architecture improvement in YOLOv7-tiny model is proposed to improve the accuracy of detecting small vehicles in the aerial image. Besides that, the image size output of the second and third prediction boxes is upscaled for better performance. This study showed that the proposed method yields a mean average precision (mAP) of 57.94%, which is higher than that of the conventional YOLOv7-tiny. In addition, the proposed model showed significant performance when compared to previous works, making it viable for application in low-cost embedded devices.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"1 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatigue driving is one of the main reasons threatening road traffic safety. Aiming at the problems of complex detection process, low accuracy, and susceptibility to light interference in the current driver fatigue detection algorithm, this paper proposes a driver Eye State detection algorithm based on YOLO, abbreviated as ES-YOLO. The algorithm optimizes the structure of YOLOv7, integrates the multi-scale features using the convolutional block attention mechanism (CBAM), and improves the attention to important spatial locations in the image. Furthermore, using the Focal-EIOU Loss instead of CIOU Loss to increase the attention on difficult samples and reduce the influence of sample class imbalance. Then, based on ES-YOLO, a driver fatigue detection method is proposed, and the driver fatigue judgment logic is designed to monitor the fatigue state in real-time and alarm in time to improve the accuracy of detection. The experiments on the public dataset CEW and the self-made dataset show that the proposed ES-YOLO obtained 99.0% and 98.8% mAP values, respectively, which are better than the compared algorithms. And this method achieves real-time and accurate detection of driver fatigue status. Source code is released in https://www.github/driver-fatigue-detection.git.
疲劳驾驶是威胁道路交通安全的主要原因之一。针对目前驾驶员疲劳检测算法中存在的检测过程复杂、准确率低、易受光线干扰等问题,本文提出了一种基于 YOLO 的驾驶员眼部状态检测算法,简称 ES-YOLO。该算法优化了 YOLOv7 的结构,利用卷积块注意力机制(CBAM)整合了多尺度特征,提高了对图像中重要空间位置的注意力。此外,使用 Focal-EIOU Loss 代替 CIOU Loss 来提高对困难样本的关注度,减少样本类别不平衡的影响。然后,基于 ES-YOLO 提出了一种驾驶员疲劳检测方法,并设计了驾驶员疲劳判断逻辑,实时监测疲劳状态并及时报警,提高了检测的准确性。在公共数据集CEW和自建数据集上的实验表明,所提出的ES-YOLO分别获得了99.0%和98.8%的mAP值,优于对比算法。该方法实现了对驾驶员疲劳状态的实时、准确检测。源代码发布于 https://www.github/driver-fatigue-detection.git。
{"title":"Driver fatigue detection based on improved YOLOv7","authors":"Xianguo Li, Xueyan Li, Zhenqian Shen, Guangmin Qian","doi":"10.1007/s11554-024-01455-3","DOIUrl":"https://doi.org/10.1007/s11554-024-01455-3","url":null,"abstract":"<p>Fatigue driving is one of the main reasons threatening road traffic safety. Aiming at the problems of complex detection process, low accuracy, and susceptibility to light interference in the current driver fatigue detection algorithm, this paper proposes a driver Eye State detection algorithm based on YOLO, abbreviated as ES-YOLO. The algorithm optimizes the structure of YOLOv7, integrates the multi-scale features using the convolutional block attention mechanism (CBAM), and improves the attention to important spatial locations in the image. Furthermore, using the Focal-EIOU Loss instead of CIOU Loss to increase the attention on difficult samples and reduce the influence of sample class imbalance. Then, based on ES-YOLO, a driver fatigue detection method is proposed, and the driver fatigue judgment logic is designed to monitor the fatigue state in real-time and alarm in time to improve the accuracy of detection. The experiments on the public dataset CEW and the self-made dataset show that the proposed ES-YOLO obtained 99.0% and 98.8% mAP values, respectively, which are better than the compared algorithms. And this method achieves real-time and accurate detection of driver fatigue status. Source code is released in https://www.github/driver-fatigue-detection.git.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"301 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s11554-024-01453-5
Lijun Wu, Shangdong Qiu, Zhicong Chen
To address the problem of incomplete segmentation of large objects and miss-segmentation of tiny objects that is universally existing in semantic segmentation algorithms, PACAMNet, a real-time segmentation network based on short-term dense concatenate of parallel atrous convolution and fusion of attentional features is proposed, called PACAMNet. First, parallel atrous convolution is introduced to improve the short-term dense concatenate module. By adjusting the size of the atrous factor, multi-scale semantic information is obtained to ensure that the last layer of the module can also obtain rich input feature maps. Second, attention feature fusion module is proposed to align the receptive fields of deep and shallow feature maps via depth-separable convolutions with different sizes, and the channel attention mechanism is used to generate weights to effectively fuse the deep and shallow feature maps. Finally, experiments are carried out based on both Cityscapes and CamVid datasets, and the segmentation accuracy achieve 77.4% and 74.0% at the inference speeds of 98.7 FPS and 134.6 FPS, respectively. Compared with other methods, PACAMNet improves the inference speed of the model while ensuring higher segmentation accuracy, so PACAMNet achieve a better balance between segmentation accuracy and inference speed.
{"title":"Real-time semantic segmentation network based on parallel atrous convolution for short-term dense concatenate and attention feature fusion","authors":"Lijun Wu, Shangdong Qiu, Zhicong Chen","doi":"10.1007/s11554-024-01453-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01453-5","url":null,"abstract":"<p>To address the problem of incomplete segmentation of large objects and miss-segmentation of tiny objects that is universally existing in semantic segmentation algorithms, PACAMNet, a real-time segmentation network based on short-term dense concatenate of parallel atrous convolution and fusion of attentional features is proposed, called PACAMNet. First, parallel atrous convolution is introduced to improve the short-term dense concatenate module. By adjusting the size of the atrous factor, multi-scale semantic information is obtained to ensure that the last layer of the module can also obtain rich input feature maps. Second, attention feature fusion module is proposed to align the receptive fields of deep and shallow feature maps via depth-separable convolutions with different sizes, and the channel attention mechanism is used to generate weights to effectively fuse the deep and shallow feature maps. Finally, experiments are carried out based on both Cityscapes and CamVid datasets, and the segmentation accuracy achieve 77.4% and 74.0% at the inference speeds of 98.7 FPS and 134.6 FPS, respectively. Compared with other methods, PACAMNet improves the inference speed of the model while ensuring higher segmentation accuracy, so PACAMNet achieve a better balance between segmentation accuracy and inference speed.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"105 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s11554-024-01454-4
Yi Liu, Yu Qiao, Yali Wang
Online action detection (OAD) aims at predicting action per frame from a streaming untrimmed video in real time. Most existing approaches leverage all the historical frames in the sliding window as the temporal context of the current frame since single-frame prediction is often unreliable. However, such a manner inevitably introduces useless even noisy video content, which often misleads action classifier when recognizing the ongoing action in the current frame. To alleviate this difficulty, we propose a concise and novel F2S-Net, which can adaptively discover the contextual segments in the online sliding window, and convert current frame prediction into relevant-segment prediction. More specifically, as the current frame can be either action or background, we develop F2S-Net with a distinct two-branch structure, i.e., the action (or background) branch can exploit the action (or background) segments. Via multi-level action supervision, these two branches can complementarily enhance each other, allowing to identify the contextual segments in the sliding window to robustly predict what is ongoing. We evaluate our approach on popular OAD benchmarks, i.e., THUMOS-14, TVSeries and HDD. The extensive results show that our F2S-Net outperforms the recent state-of-the-art approaches.
{"title":"F2S-Net: learning frame-to-segment prediction for online action detection","authors":"Yi Liu, Yu Qiao, Yali Wang","doi":"10.1007/s11554-024-01454-4","DOIUrl":"https://doi.org/10.1007/s11554-024-01454-4","url":null,"abstract":"<p>Online action detection (OAD) aims at predicting action per frame from a streaming untrimmed video in real time. Most existing approaches leverage all the historical frames in the sliding window as the temporal context of the current frame since single-frame prediction is often unreliable. However, such a manner inevitably introduces useless even noisy video content, which often misleads action classifier when recognizing the ongoing action in the current frame. To alleviate this difficulty, we propose a concise and novel F2S-Net, which can adaptively discover the contextual segments in the online sliding window, and convert current frame prediction into relevant-segment prediction. More specifically, as the current frame can be either action or background, we develop F2S-Net with a distinct two-branch structure, i.e., the action (or background) branch can exploit the action (or background) segments. Via multi-level action supervision, these two branches can complementarily enhance each other, allowing to identify the contextual segments in the sliding window to robustly predict what is ongoing. We evaluate our approach on popular OAD benchmarks, i.e., THUMOS-14, TVSeries and HDD. The extensive results show that our F2S-Net outperforms the recent state-of-the-art approaches.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"22 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-09DOI: 10.1007/s11554-024-01437-5
Gang Dong, Yefei Zhang, Weicheng Xie, Yong Huang
Given the current safety helmet detection methods, the feature information of the small-scale safety helmet will be lost after the network model is convolved many times, resulting in the problem of missing detection of the safety helmet. To this end, an improved target detection algorithm of YOLOv5 is used to detect the wearing of safety helmets. Firstly, a new small-scale detection layer is added to the head of the network for multi-scale feature fusion, thereby increasing the receptive field area of the feature map to improve the model’s recognition of small targets. Secondly, a cross-layer connection is designed between the feature extraction network and the feature fusion network to enhance the fine-grained features of the target in the shallow layer of the network. Thirdly, a coordinate attention (CA) module is added to the cross-layer connection to capture the global information of the image and improve the localization ability of the target. Finally, the Normalized Wasserstein Distance (NWD) is used to measure the similarity between bounding boxes, replacing the intersection over union (IoU) method. The experimental results show that the improved model achieves 95.09% of the mAP value for safety helmet-wearing detection, which has a good effect on the recognition of small-sized safety helmets of different degrees in the construction work scene.
{"title":"A safety helmet-wearing detection method based on cross-layer connection","authors":"Gang Dong, Yefei Zhang, Weicheng Xie, Yong Huang","doi":"10.1007/s11554-024-01437-5","DOIUrl":"https://doi.org/10.1007/s11554-024-01437-5","url":null,"abstract":"<p>Given the current safety helmet detection methods, the feature information of the small-scale safety helmet will be lost after the network model is convolved many times, resulting in the problem of missing detection of the safety helmet. To this end, an improved target detection algorithm of YOLOv5 is used to detect the wearing of safety helmets. Firstly, a new small-scale detection layer is added to the head of the network for multi-scale feature fusion, thereby increasing the receptive field area of the feature map to improve the model’s recognition of small targets. Secondly, a cross-layer connection is designed between the feature extraction network and the feature fusion network to enhance the fine-grained features of the target in the shallow layer of the network. Thirdly, a coordinate attention (CA) module is added to the cross-layer connection to capture the global information of the image and improve the localization ability of the target. Finally, the Normalized Wasserstein Distance (NWD) is used to measure the similarity between bounding boxes, replacing the intersection over union (IoU) method. The experimental results show that the improved model achieves 95.09% of the mAP value for safety helmet-wearing detection, which has a good effect on the recognition of small-sized safety helmets of different degrees in the construction work scene.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"19 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-05DOI: 10.1007/s11554-024-01434-8
Kelun Tang, Lin Lang, Xiaojun Zhou
Parametric active contour model is an efficient approach for image segmentation. However, the high cost of evolution computation has restricted their potential applications to contour segmentation with long perimeter. Extensive algorithm debugging and analysis indicate that the inverse matrix calculation and the matrix multiplication are the two major reasons. In this paper, a novel simple and efficient algorithm for evolution computation is proposed. Motivated by the relationship between the eigenvalues and the entries in the circular Toeplitz matrix, each entry expression of inverse matrix is firstly derived through mathematical deduction, and then, the matrix multiplication is simplified into a more efficient convolution operation. Experimental results show that the proposed algorithm can significantly improve the computational speed by one to two orders of magnitude and is even more efficient for contour extraction with large perimeter.
{"title":"Equivalent convolution strategy for the evolution computation in parametric active contour model","authors":"Kelun Tang, Lin Lang, Xiaojun Zhou","doi":"10.1007/s11554-024-01434-8","DOIUrl":"https://doi.org/10.1007/s11554-024-01434-8","url":null,"abstract":"<p>Parametric active contour model is an efficient approach for image segmentation. However, the high cost of evolution computation has restricted their potential applications to contour segmentation with long perimeter. Extensive algorithm debugging and analysis indicate that the inverse matrix calculation and the matrix multiplication are the two major reasons. In this paper, a novel simple and efficient algorithm for evolution computation is proposed. Motivated by the relationship between the eigenvalues and the entries in the circular Toeplitz matrix, each entry expression of inverse matrix is firstly derived through mathematical deduction, and then, the matrix multiplication is simplified into a more efficient convolution operation. Experimental results show that the proposed algorithm can significantly improve the computational speed by one to two orders of magnitude and is even more efficient for contour extraction with large perimeter.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"85 1","pages":""},"PeriodicalIF":3.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140601904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}