The preprocessing of point cloud data has always been an important problem in 3D object detection. Due to the large volume of point cloud data, voxelization methods are often used to represent the point cloud while reducing data density. However, common voxelization randomly selects sampling points from voxels, which often fails to represent local spatial features well due to noise. To preserve local features, this paper proposes an optimized voxel downsampling(OVD) method based on evidence theory. This method uses fuzzy sets to model basic probability assignments (BPAs) for each candidate point, incorporating point location information. It then employs evidence theory to fuse the BPAs and determine the selected sampling points. In the PointPillars 3D object detection algorithm, the point cloud is partitioned into pillars and encoded using each pillar’s points. Convolutional neural networks are used for feature extraction and detection. Another contribution is the proposed improved PointPillars based on evidence theory (ET-PointPillars) by introducing an OVD-based feature point sampling module in the PointPillars’ pillar feature network, which can select feature points in pillars using the optimized method, computes offsets to these points, and adds them as features to facilitate learning more object characteristics, improving traditional PointPillars. Experiments on the KITTI datasets validate the method’s ability to preserve local spatial features. Results showed improved detection precision, with a (2.73%) average increase for pedestrians and cyclists on KITTI.
{"title":"ET-PointPillars: improved PointPillars for 3D object detection based on optimized voxel downsampling","authors":"Yiyi Liu, Zhengyi Yang, JianLin Tong, Jiajia Yang, Jiongcheng Peng, Lihang Zhang, Wangxin Cheng","doi":"10.1007/s00138-024-01538-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01538-y","url":null,"abstract":"<p>The preprocessing of point cloud data has always been an important problem in 3D object detection. Due to the large volume of point cloud data, voxelization methods are often used to represent the point cloud while reducing data density. However, common voxelization randomly selects sampling points from voxels, which often fails to represent local spatial features well due to noise. To preserve local features, this paper proposes an optimized voxel downsampling(OVD) method based on evidence theory. This method uses fuzzy sets to model basic probability assignments (BPAs) for each candidate point, incorporating point location information. It then employs evidence theory to fuse the BPAs and determine the selected sampling points. In the PointPillars 3D object detection algorithm, the point cloud is partitioned into pillars and encoded using each pillar’s points. Convolutional neural networks are used for feature extraction and detection. Another contribution is the proposed improved PointPillars based on evidence theory (ET-PointPillars) by introducing an OVD-based feature point sampling module in the PointPillars’ pillar feature network, which can select feature points in pillars using the optimized method, computes offsets to these points, and adds them as features to facilitate learning more object characteristics, improving traditional PointPillars. Experiments on the KITTI datasets validate the method’s ability to preserve local spatial features. Results showed improved detection precision, with a <span>(2.73%)</span> average increase for pedestrians and cyclists on KITTI.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"101 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-20DOI: 10.1007/s00138-024-01539-x
Wei Tian, Fan Luo, Kailing Shen
Unsupervised video prediction is widely applied in intelligent decision-making scenarios due to its capability to model unknown scenes. Traditional video prediction models based on Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) consume large amounts of computational resources while constantly losing the original picture information. This paper addresses the challenges discussed and introduces PSRUNet, a novel model featuring the lightweight ParallelSRU unit. By prioritizing global spatiotemporal features and minimizing redundancy, PSRUNet effectively enhances the model’s early perception of complex spatiotemporal changes. The addition of an encoder-decoder architecture captures high-dimensional image information, and information recall is introduced to mitigate gradient vanishing during deep network training. We evaluated the performance of PSRUNet and analyzed the capabilities of ParallelSRU in real-world applications, including short-term precipitation forecasting, traffic flow prediction, and human behavior prediction. Experimental results across multiple video prediction benchmarks demonstrate that PSRUNet achieves remarkably efficient and cost-effective predictions, making it a promising solution for meeting the real-time and accuracy requirements of practical business scenarios.
{"title":"PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit","authors":"Wei Tian, Fan Luo, Kailing Shen","doi":"10.1007/s00138-024-01539-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01539-x","url":null,"abstract":"<p>Unsupervised video prediction is widely applied in intelligent decision-making scenarios due to its capability to model unknown scenes. Traditional video prediction models based on Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) consume large amounts of computational resources while constantly losing the original picture information. This paper addresses the challenges discussed and introduces PSRUNet, a novel model featuring the lightweight ParallelSRU unit. By prioritizing global spatiotemporal features and minimizing redundancy, PSRUNet effectively enhances the model’s early perception of complex spatiotemporal changes. The addition of an encoder-decoder architecture captures high-dimensional image information, and information recall is introduced to mitigate gradient vanishing during deep network training. We evaluated the performance of PSRUNet and analyzed the capabilities of ParallelSRU in real-world applications, including short-term precipitation forecasting, traffic flow prediction, and human behavior prediction. Experimental results across multiple video prediction benchmarks demonstrate that PSRUNet achieves remarkably efficient and cost-effective predictions, making it a promising solution for meeting the real-time and accuracy requirements of practical business scenarios.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"81 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.1007/s00138-024-01520-8
Jiazheng Wen, Huanyu Liu, Junbao Li
Multi-object tracking in dense scenes has always been a major difficulty in this field. Although some existing algorithms achieve excellent results in multi-object tracking, they fail to achieve good generalization when the application background is transferred to more challenging dense scenarios. In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking. It utilizes dense inter-frame similarity to perform object appearance feature comparisons to predict the inter-frame position changes of objects, extending CenterTrack by using only motion features. We propose a feature enhancement method based on a hybrid attention mechanism, which adds information on the temporal dimension between frames to the features required for object detection, and connects the two tasks of detection and tracking. Under the MOT20 benchmark, PTDS CenterTrack has achieved 55.6%MOTA, 55.1%IDF1, 45.1%HOTA, which is an increase of 10.1 percentage points, 4.0 percentage points, and 4.8 percentage points respectively compared to CenterTrack.
{"title":"PTDS CenterTrack: pedestrian tracking in dense scenes with re-identification and feature enhancement","authors":"Jiazheng Wen, Huanyu Liu, Junbao Li","doi":"10.1007/s00138-024-01520-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01520-8","url":null,"abstract":"<p>Multi-object tracking in dense scenes has always been a major difficulty in this field. Although some existing algorithms achieve excellent results in multi-object tracking, they fail to achieve good generalization when the application background is transferred to more challenging dense scenarios. In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking. It utilizes dense inter-frame similarity to perform object appearance feature comparisons to predict the inter-frame position changes of objects, extending CenterTrack by using only motion features. We propose a feature enhancement method based on a hybrid attention mechanism, which adds information on the temporal dimension between frames to the features required for object detection, and connects the two tasks of detection and tracking. Under the MOT20 benchmark, PTDS CenterTrack has achieved 55.6%MOTA, 55.1%IDF1, 45.1%HOTA, which is an increase of 10.1 percentage points, 4.0 percentage points, and 4.8 percentage points respectively compared to CenterTrack.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2016 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1007/s00138-024-01531-5
Vukasin D. Stanojevic, Branimir T. Todorovic
Handling unreliable detections and avoiding identity switches are crucial for the success of multiple object tracking (MOT). Ideally, MOT algorithm should use true positive detections only, work in real-time and produce no identity switches. To approach the described ideal solution, we present the BoostTrack, a simple yet effective tracing-by-detection MOT method that utilizes several lightweight plug and play additions to improve MOT performance. We design a detection-tracklet confidence score and use it to scale the similarity measure and implicitly favour high detection confidence and high tracklet confidence pairs in one-stage association. To reduce the ambiguity arising from using intersection over union (IoU), we propose a novel Mahalanobis distance and shape similarity additions to boost the overall similarity measure. To utilize low-detection score bounding boxes in one-stage association, we propose to boost the confidence scores of two groups of detections: the detections we assume to correspond to the existing tracked object, and the detections we assume to correspond to a previously undetected object. The proposed additions are orthogonal to the existing approaches, and we combine them with interpolation and camera motion compensation to achieve results comparable to the standard benchmark solutions while retaining real-time execution speed. When combined with appearance similarity, our method outperforms all standard benchmark solutions on MOT17 and MOT20 datasets. It ranks first among online methods in HOTA metric in the MOT Challenge on MOT17 and MOT20 test sets. We make our code available at https://github.com/vukasin-stanojevic/BoostTrack.
处理不可靠的检测和避免身份转换是多目标跟踪(MOT)成功的关键。理想情况下,多目标跟踪算法应只使用真正的正向检测,实时工作,并且不产生身份转换。为了接近所描述的理想解决方案,我们提出了 BoostTrack,这是一种简单而有效的通过检测进行追踪的 MOT 方法,它利用几个轻量级的即插即用附加功能来提高 MOT 性能。我们设计了一个检测-小轨迹置信度得分,并用它来扩展相似度量,在单阶段关联中隐性地偏向于高检测置信度和高小轨迹置信度对。为了减少因使用交集大于联合(IoU)而产生的歧义,我们提出了一种新的 Mahalanobis 距离和形状相似性加法,以提高整体相似性度量。为了在单阶段关联中利用低检测得分的边界框,我们建议提高两组检测的置信度得分:一组是我们假定与现有跟踪对象相对应的检测,另一组是我们假定与之前未检测到的对象相对应的检测。我们提出的附加方法与现有方法正交,并与插值和摄像机运动补偿相结合,在保持实时执行速度的同时,实现了与标准基准解决方案相当的结果。当与外观相似性相结合时,我们的方法在 MOT17 和 MOT20 数据集上的表现优于所有标准基准解决方案。在 MOT 挑战赛的 MOT17 和 MOT20 测试集上,我们的方法在 HOTA 指标的在线方法中排名第一。我们在 https://github.com/vukasin-stanojevic/BoostTrack 上提供了我们的代码。
{"title":"BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking","authors":"Vukasin D. Stanojevic, Branimir T. Todorovic","doi":"10.1007/s00138-024-01531-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01531-5","url":null,"abstract":"<p>Handling unreliable detections and avoiding identity switches are crucial for the success of multiple object tracking (MOT). Ideally, MOT algorithm should use true positive detections only, work in real-time and produce no identity switches. To approach the described ideal solution, we present the BoostTrack, a simple yet effective tracing-by-detection MOT method that utilizes several lightweight plug and play additions to improve MOT performance. We design a detection-tracklet confidence score and use it to scale the similarity measure and implicitly favour high detection confidence and high tracklet confidence pairs in one-stage association. To reduce the ambiguity arising from using intersection over union (IoU), we propose a novel Mahalanobis distance and shape similarity additions to boost the overall similarity measure. To utilize low-detection score bounding boxes in one-stage association, we propose to boost the confidence scores of two groups of detections: the detections we assume to correspond to the existing tracked object, and the detections we assume to correspond to a previously undetected object. The proposed additions are orthogonal to the existing approaches, and we combine them with interpolation and camera motion compensation to achieve results comparable to the standard benchmark solutions while retaining real-time execution speed. When combined with appearance similarity, our method outperforms all standard benchmark solutions on MOT17 and MOT20 datasets. It ranks first among online methods in HOTA metric in the MOT Challenge on MOT17 and MOT20 test sets. We make our code available at https://github.com/vukasin-stanojevic/BoostTrack.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"298 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s00138-024-01535-1
Chhavi Dhiman, Anunay Varshney, Ved Vyapak
Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (({text{i}})) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (({text{ii}})) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.
{"title":"AP-TransNet: a polarized transformer based aerial human action recognition framework","authors":"Chhavi Dhiman, Anunay Varshney, Ved Vyapak","doi":"10.1007/s00138-024-01535-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01535-1","url":null,"abstract":"<p>Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (<span>({text{i}}))</span> Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (<span>({text{ii}}))</span> boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"52 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-09DOI: 10.1007/s00138-024-01533-3
Georgios Petrakis, Panagiotis Partsinevelos
Semantic segmentation plays a significant role in unstructured and planetary scene understanding, offering to a robotic system or a planetary rover valuable knowledge about its surroundings. Several studies investigate rover-based scene recognition planetary-like environments but there is a lack of a semantic segmentation architecture, focused on computing systems with low resources and tested on the lunar surface. In this study, a lightweight encoder-decoder neural network (NN) architecture is proposed for rover-based ground segmentation on the lunar surface. The proposed architecture is composed by a modified MobilenetV2 as encoder and a lightweight U-net decoder while the training and evaluation process were conducted using a publicly available synthetic dataset with lunar landscape images. The proposed model provides robust segmentation results, allowing the lunar scene understanding focused on rocks and boulders. It achieves similar accuracy, compared with original U-net and U-net-based architectures which are 110–140 times larger than the proposed architecture. This study, aims to contribute in lunar landscape segmentation utilizing deep learning techniques, while it proves a great potential in autonomous lunar navigation ensuring a safer and smoother navigation on the moon. To the best of our knowledge, this is the first study which propose a lightweight semantic segmentation architecture for the lunar surface, aiming to reinforce the autonomous rover navigation.
{"title":"Lunar ground segmentation using a modified U-net neural network","authors":"Georgios Petrakis, Panagiotis Partsinevelos","doi":"10.1007/s00138-024-01533-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01533-3","url":null,"abstract":"<p>Semantic segmentation plays a significant role in unstructured and planetary scene understanding, offering to a robotic system or a planetary rover valuable knowledge about its surroundings. Several studies investigate rover-based scene recognition planetary-like environments but there is a lack of a semantic segmentation architecture, focused on computing systems with low resources and tested on the lunar surface. In this study, a lightweight encoder-decoder neural network (NN) architecture is proposed for rover-based ground segmentation on the lunar surface. The proposed architecture is composed by a modified MobilenetV2 as encoder and a lightweight U-net decoder while the training and evaluation process were conducted using a publicly available synthetic dataset with lunar landscape images. The proposed model provides robust segmentation results, allowing the lunar scene understanding focused on rocks and boulders. It achieves similar accuracy, compared with original U-net and U-net-based architectures which are 110–140 times larger than the proposed architecture. This study, aims to contribute in lunar landscape segmentation utilizing deep learning techniques, while it proves a great potential in autonomous lunar navigation ensuring a safer and smoother navigation on the moon. To the best of our knowledge, this is the first study which propose a lightweight semantic segmentation architecture for the lunar surface, aiming to reinforce the autonomous rover navigation.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"65 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-09DOI: 10.1007/s00138-024-01529-z
Chenyu Ma, Jinfang Jia, Jianqiang Huang, Li Wu, Xiaoying Wang
Few-shot learning (FSL) aims to adapt quickly to new categories with limited samples. Despite significant progress in utilizing meta-learning for solving FSL tasks, challenges such as overfitting and poor generalization still exist. Building upon the demonstrated significance of powerful feature representation, this work proposes disRot, a novel two-strategy training mechanism, which combines knowledge distillation and rotation prediction task for the pre-training phase of transfer learning. Knowledge distillation enables shallow networks to learn relational knowledge contained in deep networks, while the self-supervised rotation prediction task provides class-irrelevant and transferable knowledge for the supervised task. Simultaneous optimization for these two tasks allows the model learn generalizable and transferable feature embedding. Extensive experiments on the miniImageNet and FC100 datasets demonstrate that disRot can effectively improve the generalization ability of the model and is comparable to the leading FSL methods.
{"title":"DisRot: boosting the generalization capability of few-shot learning via knowledge distillation and self-supervised learning","authors":"Chenyu Ma, Jinfang Jia, Jianqiang Huang, Li Wu, Xiaoying Wang","doi":"10.1007/s00138-024-01529-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01529-z","url":null,"abstract":"<p>Few-shot learning (FSL) aims to adapt quickly to new categories with limited samples. Despite significant progress in utilizing meta-learning for solving FSL tasks, challenges such as overfitting and poor generalization still exist. Building upon the demonstrated significance of powerful feature representation, this work proposes disRot, a novel two-strategy training mechanism, which combines knowledge distillation and rotation prediction task for the pre-training phase of transfer learning. Knowledge distillation enables shallow networks to learn relational knowledge contained in deep networks, while the self-supervised rotation prediction task provides class-irrelevant and transferable knowledge for the supervised task. Simultaneous optimization for these two tasks allows the model learn generalizable and transferable feature embedding. Extensive experiments on the miniImageNet and FC100 datasets demonstrate that disRot can effectively improve the generalization ability of the model and is comparable to the leading FSL methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"37 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The quality of annotations in the datasets is crucial for supervised machine learning as it significantly affects the performance of models. While many public datasets are widely used, they often suffer from annotations errors, including missing annotations, incorrect bounding box sizes, and positions. It results in low accuracy of machine learning models. However, most researchers have traditionally focused on improving model performance by enhancing algorithms, while overlooking concerns regarding data quality. This so-called model-centric AI approach has been predominant. In contrast, a data-centric AI approach, advocated by Andrew Ng at the DATA and AI Summit 2022, emphasizes enhancing data quality while keeping the model fixed, which proves to be more efficient in improving performance. Building upon this data-centric approach, we propose a method to enhance the quality of public datasets such as MS-COCO and Open Image Dataset. Our approach involves automatically retrieving missing annotations and correcting the size and position of existing bounding boxes in these datasets. Specifically, our study deals with human object detection, which is one of the prominent applications of artificial intelligence. Experimental results demonstrate improved performance with models such as Faster-RCNN, EfficientDet, and RetinaNet. We can achieve up to 32% compared to original datasets in the term of mAP after applying both proposed methods to dataset which is transformed the grouped of instances to individual instance. In summary, our methods significantly enhance the model’s performance by improving the quality of annotations at a lower cost with less time than manual improvement employed in other studies.
{"title":"The improvement of ground truth annotation in public datasets for human detection","authors":"Sotheany Nou, Joong-Sun Lee, Nagaaki Ohyama, Takashi Obi","doi":"10.1007/s00138-024-01527-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01527-1","url":null,"abstract":"<p>The quality of annotations in the datasets is crucial for supervised machine learning as it significantly affects the performance of models. While many public datasets are widely used, they often suffer from annotations errors, including missing annotations, incorrect bounding box sizes, and positions. It results in low accuracy of machine learning models. However, most researchers have traditionally focused on improving model performance by enhancing algorithms, while overlooking concerns regarding data quality. This so-called model-centric AI approach has been predominant. In contrast, a data-centric AI approach, advocated by Andrew Ng at the DATA and AI Summit 2022, emphasizes enhancing data quality while keeping the model fixed, which proves to be more efficient in improving performance. Building upon this data-centric approach, we propose a method to enhance the quality of public datasets such as MS-COCO and Open Image Dataset. Our approach involves automatically retrieving missing annotations and correcting the size and position of existing bounding boxes in these datasets. Specifically, our study deals with human object detection, which is one of the prominent applications of artificial intelligence. Experimental results demonstrate improved performance with models such as Faster-RCNN, EfficientDet, and RetinaNet. We can achieve up to 32% compared to original datasets in the term of mAP after applying both proposed methods to dataset which is transformed the grouped of instances to individual instance. In summary, our methods significantly enhance the model’s performance by improving the quality of annotations at a lower cost with less time than manual improvement employed in other studies.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"23 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we introduce an advanced approach for enhanced image denoising using an improved space-variant anisotropic Partial Differential Equation (PDE) framework. Leveraging Weickert-type operators, this method relies on two critical parameters: (lambda ) and (theta ), defining local image geometry and smoothing strength. We propose an automatic parameter estimation technique rooted in PDE-constrained optimization, incorporating supplementary information from the original clean image. By combining these components, our approach achieves superior image denoising, pushing the boundaries of image enhancement methods. We employed a modified Alternating Direction Method of Multipliers (ADMM) procedure for numerical optimization, demonstrating its efficacy through thorough assessments and affirming its superior performance compared to alternative denoising methods.
{"title":"Tensor-guided learning for image denoising using anisotropic PDEs","authors":"Fakhr-eddine Limami, Aissam Hadri, Lekbir Afraites, Amine Laghrib","doi":"10.1007/s00138-024-01532-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01532-4","url":null,"abstract":"<p>In this article, we introduce an advanced approach for enhanced image denoising using an improved space-variant anisotropic Partial Differential Equation (PDE) framework. Leveraging Weickert-type operators, this method relies on two critical parameters: <span>(lambda )</span> and <span>(theta )</span>, defining local image geometry and smoothing strength. We propose an automatic parameter estimation technique rooted in PDE-constrained optimization, incorporating supplementary information from the original clean image. By combining these components, our approach achieves superior image denoising, pushing the boundaries of image enhancement methods. We employed a modified Alternating Direction Method of Multipliers (ADMM) procedure for numerical optimization, demonstrating its efficacy through thorough assessments and affirming its superior performance compared to alternative denoising methods.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"44 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-07DOI: 10.1007/s00138-024-01526-2
Wugen Zhou, Xiaodong Peng, Yun Li, Mingrui Fan, Bo Liu
The robustness of dense visual SLAM is still a challenging problem in dynamic environments. In this paper, we propose a novel keyframe-based dense visual SLAM to handle a highly dynamic environment by using an RGB-D camera. The proposed method uses cluster-based residual models and semantic cues to detect dynamic objects, resulting in motion segmentation that outperforms traditional methods. The method also employs motion-segmentation based keyframe selection strategies and frame-to-keyframe matching scheme that reduce the influence of dynamic objects, thus minimizing trajectory errors. We further filter out dynamic object influence based on motion segmentation and then employ true matches from keyframes, which are near the current keyframe, to facilitate loop closure. Finally, a pose graph is established and optimized using the g2o framework. Our experimental results demonstrate the success of our approach in handling highly dynamic sequences, as evidenced by the more robust motion segmentation results and significantly lower trajectory drift compared to several state-of-the-art dense visual odometry or SLAM methods on challenging public benchmark datasets.
在动态环境中,密集视觉 SLAM 的鲁棒性仍然是一个具有挑战性的问题。在本文中,我们提出了一种新颖的基于关键帧的密集视觉 SLAM,利用 RGB-D 摄像机来处理高动态环境。所提出的方法利用基于聚类的残差模型和语义线索来检测动态物体,从而实现优于传统方法的运动分割。该方法还采用了基于运动分割的关键帧选择策略和帧到关键帧匹配方案,以减少动态物体的影响,从而将轨迹误差降至最低。我们在运动分割的基础上进一步过滤掉动态物体的影响,然后采用与当前关键帧附近的关键帧的真实匹配,以促进循环闭合。最后,我们使用 g2o 框架建立并优化了姿势图。我们的实验结果证明了我们的方法在处理高动态序列方面的成功,在具有挑战性的公共基准数据集上,与几种最先进的密集视觉里程测量或 SLAM 方法相比,我们的运动分割结果更加稳健,轨迹漂移明显降低。
{"title":"Keyframe-based RGB-D dense visual SLAM fused semantic cues in dynamic scenes","authors":"Wugen Zhou, Xiaodong Peng, Yun Li, Mingrui Fan, Bo Liu","doi":"10.1007/s00138-024-01526-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01526-2","url":null,"abstract":"<p>The robustness of dense visual SLAM is still a challenging problem in dynamic environments. In this paper, we propose a novel keyframe-based dense visual SLAM to handle a highly dynamic environment by using an RGB-D camera. The proposed method uses cluster-based residual models and semantic cues to detect dynamic objects, resulting in motion segmentation that outperforms traditional methods. The method also employs motion-segmentation based keyframe selection strategies and frame-to-keyframe matching scheme that reduce the influence of dynamic objects, thus minimizing trajectory errors. We further filter out dynamic object influence based on motion segmentation and then employ true matches from keyframes, which are near the current keyframe, to facilitate loop closure. Finally, a pose graph is established and optimized using the g2o framework. Our experimental results demonstrate the success of our approach in handling highly dynamic sequences, as evidenced by the more robust motion segmentation results and significantly lower trajectory drift compared to several state-of-the-art dense visual odometry or SLAM methods on challenging public benchmark datasets.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"53 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}