Machine Vision and Applications最新文献_第7页

ET-PointPillars: improved PointPillars for 3D object detection based on optimized voxel downsampling ET-PointPillars：基于优化体素下采样的改进型三维物体检测 PointPillars

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-21 DOI: 10.1007/s00138-024-01538-y

Yiyi Liu, Zhengyi Yang, JianLin Tong, Jiajia Yang, Jiongcheng Peng, Lihang Zhang, Wangxin Cheng

The preprocessing of point cloud data has always been an important problem in 3D object detection. Due to the large volume of point cloud data, voxelization methods are often used to represent the point cloud while reducing data density. However, common voxelization randomly selects sampling points from voxels, which often fails to represent local spatial features well due to noise. To preserve local features, this paper proposes an optimized voxel downsampling(OVD) method based on evidence theory. This method uses fuzzy sets to model basic probability assignments (BPAs) for each candidate point, incorporating point location information. It then employs evidence theory to fuse the BPAs and determine the selected sampling points. In the PointPillars 3D object detection algorithm, the point cloud is partitioned into pillars and encoded using each pillar’s points. Convolutional neural networks are used for feature extraction and detection. Another contribution is the proposed improved PointPillars based on evidence theory (ET-PointPillars) by introducing an OVD-based feature point sampling module in the PointPillars’ pillar feature network, which can select feature points in pillars using the optimized method, computes offsets to these points, and adds them as features to facilitate learning more object characteristics, improving traditional PointPillars. Experiments on the KITTI datasets validate the method’s ability to preserve local spatial features. Results showed improved detection precision, with a (2.73%) average increase for pedestrians and cyclists on KITTI.

点云数据的预处理一直是三维物体检测中的一个重要问题。由于点云数据量大，通常采用体素化方法来表示点云，同时降低数据密度。然而，普通的体素化是从体素中随机选择采样点，由于噪声的影响，往往不能很好地表示局部空间特征。为了保留局部特征，本文提出了一种基于证据理论的优化体素下采样（OVD）方法。该方法使用模糊集为每个候选点的基本概率分布（BPA）建模，并结合了点的位置信息。然后，它利用证据理论来融合 BPA，并确定所选的采样点。在 PointPillars 3D 物体检测算法中，点云被分割成支柱，并使用每个支柱的点进行编码。卷积神经网络用于特征提取和检测。另一个贡献是提出了基于证据理论的改进型 PointPillars（ET-PointPillars），在 PointPillars 的支柱特征网络中引入了基于 OVD 的特征点采样模块，该模块可以使用优化方法选择支柱中的特征点，计算这些点的偏移量，并将其添加为特征，以便学习更多对象特征，从而改进了传统的 PointPillars。在 KITTI 数据集上的实验验证了该方法保留局部空间特征的能力。结果表明，KITTI 数据集上行人和骑自行车者的检测精度提高了，平均提高了（2.73%）。

{"title":"ET-PointPillars: improved PointPillars for 3D object detection based on optimized voxel downsampling","authors":"Yiyi Liu, Zhengyi Yang, JianLin Tong, Jiajia Yang, Jiongcheng Peng, Lihang Zhang, Wangxin Cheng","doi":"10.1007/s00138-024-01538-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01538-y","url":null,"abstract":"The preprocessing of point cloud data has always been an important problem in 3D object detection. Due to the large volume of point cloud data, voxelization methods are often used to represent the point cloud while reducing data density. However, common voxelization randomly selects sampling points from voxels, which often fails to represent local spatial features well due to noise. To preserve local features, this paper proposes an optimized voxel downsampling(OVD) method based on evidence theory. This method uses fuzzy sets to model basic probability assignments (BPAs) for each candidate point, incorporating point location information. It then employs evidence theory to fuse the BPAs and determine the selected sampling points. In the PointPillars 3D object detection algorithm, the point cloud is partitioned into pillars and encoded using each pillar’s points. Convolutional neural networks are used for feature extraction and detection. Another contribution is the proposed improved PointPillars based on evidence theory (ET-PointPillars) by introducing an OVD-based feature point sampling module in the PointPillars’ pillar feature network, which can select feature points in pillars using the optimized method, computes offsets to these points, and adds them as features to facilitate learning more object characteristics, improving traditional PointPillars. Experiments on the KITTI datasets validate the method’s ability to preserve local spatial features. Results showed improved detection precision, with a (2.73%) average increase for pedestrians and cyclists on KITTI.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"101 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit PSRUNet：基于并行简单递归单元的时空序列预测递归神经网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-20 DOI: 10.1007/s00138-024-01539-x

Wei Tian, Fan Luo, Kailing Shen

Unsupervised video prediction is widely applied in intelligent decision-making scenarios due to its capability to model unknown scenes. Traditional video prediction models based on Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) consume large amounts of computational resources while constantly losing the original picture information. This paper addresses the challenges discussed and introduces PSRUNet, a novel model featuring the lightweight ParallelSRU unit. By prioritizing global spatiotemporal features and minimizing redundancy, PSRUNet effectively enhances the model’s early perception of complex spatiotemporal changes. The addition of an encoder-decoder architecture captures high-dimensional image information, and information recall is introduced to mitigate gradient vanishing during deep network training. We evaluated the performance of PSRUNet and analyzed the capabilities of ParallelSRU in real-world applications, including short-term precipitation forecasting, traffic flow prediction, and human behavior prediction. Experimental results across multiple video prediction benchmarks demonstrate that PSRUNet achieves remarkably efficient and cost-effective predictions, making it a promising solution for meeting the real-time and accuracy requirements of practical business scenarios.

无监督视频预测因其对未知场景的建模能力而被广泛应用于智能决策场景。传统的视频预测模型基于长短时记忆（LSTM）和门递归单元（GRU），在消耗大量计算资源的同时不断丢失原始图像信息。本文针对上述挑战，介绍了 PSRUNet，一种以轻量级 ParallelSRU 单元为特色的新型模型。PSRUNet 优先考虑全局时空特征并尽量减少冗余，从而有效增强了模型对复杂时空变化的早期感知能力。新增的编码器-解码器架构可捕捉高维图像信息，并引入了信息召回功能，以缓解深度网络训练过程中的梯度消失问题。我们评估了 PSRUNet 的性能，并分析了 ParallelSRU 在实际应用中的能力，包括短期降水预测、交通流量预测和人类行为预测。多个视频预测基准的实验结果表明，PSRUNet 实现了非常高效和经济的预测，是满足实际业务场景中实时性和准确性要求的理想解决方案。

{"title":"PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit","authors":"Wei Tian, Fan Luo, Kailing Shen","doi":"10.1007/s00138-024-01539-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01539-x","url":null,"abstract":"Unsupervised video prediction is widely applied in intelligent decision-making scenarios due to its capability to model unknown scenes. Traditional video prediction models based on Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) consume large amounts of computational resources while constantly losing the original picture information. This paper addresses the challenges discussed and introduces PSRUNet, a novel model featuring the lightweight ParallelSRU unit. By prioritizing global spatiotemporal features and minimizing redundancy, PSRUNet effectively enhances the model’s early perception of complex spatiotemporal changes. The addition of an encoder-decoder architecture captures high-dimensional image information, and information recall is introduced to mitigate gradient vanishing during deep network training. We evaluated the performance of PSRUNet and analyzed the capabilities of ParallelSRU in real-world applications, including short-term precipitation forecasting, traffic flow prediction, and human behavior prediction. Experimental results across multiple video prediction benchmarks demonstrate that PSRUNet achieves remarkably efficient and cost-effective predictions, making it a promising solution for meeting the real-time and accuracy requirements of practical business scenarios.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"81 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PTDS CenterTrack: pedestrian tracking in dense scenes with re-identification and feature enhancement PTDS CenterTrack：通过重新识别和特征增强在密集场景中跟踪行人

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-15 DOI: 10.1007/s00138-024-01520-8

Jiazheng Wen, Huanyu Liu, Junbao Li

Multi-object tracking in dense scenes has always been a major difficulty in this field. Although some existing algorithms achieve excellent results in multi-object tracking, they fail to achieve good generalization when the application background is transferred to more challenging dense scenarios. In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking. It utilizes dense inter-frame similarity to perform object appearance feature comparisons to predict the inter-frame position changes of objects, extending CenterTrack by using only motion features. We propose a feature enhancement method based on a hybrid attention mechanism, which adds information on the temporal dimension between frames to the features required for object detection, and connects the two tasks of detection and tracking. Under the MOT20 benchmark, PTDS CenterTrack has achieved 55.6%MOTA, 55.1%IDF1, 45.1%HOTA, which is an increase of 10.1 percentage points, 4.0 percentage points, and 4.8 percentage points respectively compared to CenterTrack.

密集场景中的多目标跟踪一直是该领域的一大难题。虽然现有的一些算法在多目标跟踪方面取得了很好的效果，但当应用背景转移到更具挑战性的密集场景时，这些算法却无法实现良好的泛化。在这项工作中，我们提出了基于 CenterTrack 的 PTDS（密集场景中的行人跟踪）CenterTrack，用于物体中心点的检测和跟踪。它利用密集帧间相似性来进行物体外观特征比较，从而预测物体在帧间的位置变化，扩展了仅使用运动特征的 CenterTrack。我们提出了一种基于混合注意力机制的特征增强方法，该方法在物体检测所需的特征基础上增加了帧间时间维度的信息，并将检测和跟踪这两项任务联系起来。在 MOT20 基准下，PTDS CenterTrack 实现了 55.6%MOTA、55.1%IDF1 和 45.1%HOTA，与 CenterTrack 相比分别提高了 10.1 个百分点、4.0 个百分点和 4.8 个百分点。

{"title":"PTDS CenterTrack: pedestrian tracking in dense scenes with re-identification and feature enhancement","authors":"Jiazheng Wen, Huanyu Liu, Junbao Li","doi":"10.1007/s00138-024-01520-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01520-8","url":null,"abstract":"Multi-object tracking in dense scenes has always been a major difficulty in this field. Although some existing algorithms achieve excellent results in multi-object tracking, they fail to achieve good generalization when the application background is transferred to more challenging dense scenarios. In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking. It utilizes dense inter-frame similarity to perform object appearance feature comparisons to predict the inter-frame position changes of objects, extending CenterTrack by using only motion features. We propose a feature enhancement method based on a hybrid attention mechanism, which adds information on the temporal dimension between frames to the features required for object detection, and connects the two tasks of detection and tracking. Under the MOT20 benchmark, PTDS CenterTrack has achieved 55.6%MOTA, 55.1%IDF1, 45.1%HOTA, which is an increase of 10.1 percentage points, 4.0 percentage points, and 4.8 percentage points respectively compared to CenterTrack.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2016 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking BoostTrack：提高相似性测量和检测可信度，改进多目标跟踪

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-12 DOI: 10.1007/s00138-024-01531-5

Vukasin D. Stanojevic, Branimir T. Todorovic

Handling unreliable detections and avoiding identity switches are crucial for the success of multiple object tracking (MOT). Ideally, MOT algorithm should use true positive detections only, work in real-time and produce no identity switches. To approach the described ideal solution, we present the BoostTrack, a simple yet effective tracing-by-detection MOT method that utilizes several lightweight plug and play additions to improve MOT performance. We design a detection-tracklet confidence score and use it to scale the similarity measure and implicitly favour high detection confidence and high tracklet confidence pairs in one-stage association. To reduce the ambiguity arising from using intersection over union (IoU), we propose a novel Mahalanobis distance and shape similarity additions to boost the overall similarity measure. To utilize low-detection score bounding boxes in one-stage association, we propose to boost the confidence scores of two groups of detections: the detections we assume to correspond to the existing tracked object, and the detections we assume to correspond to a previously undetected object. The proposed additions are orthogonal to the existing approaches, and we combine them with interpolation and camera motion compensation to achieve results comparable to the standard benchmark solutions while retaining real-time execution speed. When combined with appearance similarity, our method outperforms all standard benchmark solutions on MOT17 and MOT20 datasets. It ranks first among online methods in HOTA metric in the MOT Challenge on MOT17 and MOT20 test sets. We make our code available at https://github.com/vukasin-stanojevic/BoostTrack.

处理不可靠的检测和避免身份转换是多目标跟踪（MOT）成功的关键。理想情况下，多目标跟踪算法应只使用真正的正向检测，实时工作，并且不产生身份转换。为了接近所描述的理想解决方案，我们提出了 BoostTrack，这是一种简单而有效的通过检测进行追踪的 MOT 方法，它利用几个轻量级的即插即用附加功能来提高 MOT 性能。我们设计了一个检测-小轨迹置信度得分，并用它来扩展相似度量，在单阶段关联中隐性地偏向于高检测置信度和高小轨迹置信度对。为了减少因使用交集大于联合（IoU）而产生的歧义，我们提出了一种新的 Mahalanobis 距离和形状相似性加法，以提高整体相似性度量。为了在单阶段关联中利用低检测得分的边界框，我们建议提高两组检测的置信度得分：一组是我们假定与现有跟踪对象相对应的检测，另一组是我们假定与之前未检测到的对象相对应的检测。我们提出的附加方法与现有方法正交，并与插值和摄像机运动补偿相结合，在保持实时执行速度的同时，实现了与标准基准解决方案相当的结果。当与外观相似性相结合时，我们的方法在 MOT17 和 MOT20 数据集上的表现优于所有标准基准解决方案。在 MOT 挑战赛的 MOT17 和 MOT20 测试集上，我们的方法在 HOTA 指标的在线方法中排名第一。我们在 https://github.com/vukasin-stanojevic/BoostTrack 上提供了我们的代码。

{"title":"BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking","authors":"Vukasin D. Stanojevic, Branimir T. Todorovic","doi":"10.1007/s00138-024-01531-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01531-5","url":null,"abstract":"Handling unreliable detections and avoiding identity switches are crucial for the success of multiple object tracking (MOT). Ideally, MOT algorithm should use true positive detections only, work in real-time and produce no identity switches. To approach the described ideal solution, we present the BoostTrack, a simple yet effective tracing-by-detection MOT method that utilizes several lightweight plug and play additions to improve MOT performance. We design a detection-tracklet confidence score and use it to scale the similarity measure and implicitly favour high detection confidence and high tracklet confidence pairs in one-stage association. To reduce the ambiguity arising from using intersection over union (IoU), we propose a novel Mahalanobis distance and shape similarity additions to boost the overall similarity measure. To utilize low-detection score bounding boxes in one-stage association, we propose to boost the confidence scores of two groups of detections: the detections we assume to correspond to the existing tracked object, and the detections we assume to correspond to a previously undetected object. The proposed additions are orthogonal to the existing approaches, and we combine them with interpolation and camera motion compensation to achieve results comparable to the standard benchmark solutions while retaining real-time execution speed. When combined with appearance similarity, our method outperforms all standard benchmark solutions on MOT17 and MOT20 datasets. It ranks first among online methods in HOTA metric in the MOT Challenge on MOT17 and MOT20 test sets. We make our code available at https://github.com/vukasin-stanojevic/BoostTrack.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"298 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AP-TransNet: a polarized transformer based aerial human action recognition framework AP-TransNet：基于极化变压器的空中人类动作识别框架

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-10 DOI: 10.1007/s00138-024-01535-1

Chhavi Dhiman, Anunay Varshney, Ved Vyapak

Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (({text{i}})) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (({text{ii}})) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.

无人机因其低成本和快速移动性而被广泛和积极地应用于各种领域，并实现了新形式的行动监控。然而，由于各种挑战--航拍视图样本数量有限、航拍镜头受相机运动、光照变化、小演员尺寸、遮挡、复杂背景和不同视角的影响，航拍视频中的人类动作识别更具挑战性。有鉴于此，我们提出了空中极化变换网络（Aerial Polarized-Transformer Network，AP-TransNet），利用视频的空间和时间细节来识别空中视图中的人类动作。在本文中，我们提出了极化编码块，它执行（({text{i}}))选择与剔除）来选择重要的特征，并剔除信息量最小的特征，这与光度测量现象和（({/text{ii}})）提升操作类似，在通道和空间顺序分支的瓶颈张量处使用非线性软最大归一化来增加编码的动态范围。通过在三个公开的基准数据集（无人机行动数据集、UCF-ARG 数据集和支持消融研究的多视角室外数据集 (MOD20)）上进行大量实验，对所提出的 AP-TransNet 的性能进行了评估。建议的工作性能优于同行。

{"title":"AP-TransNet: a polarized transformer based aerial human action recognition framework","authors":"Chhavi Dhiman, Anunay Varshney, Ved Vyapak","doi":"10.1007/s00138-024-01535-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01535-1","url":null,"abstract":"Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (({text{i}})) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (({text{ii}})) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"52 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lunar ground segmentation using a modified U-net neural network 利用改进的 U-net 神经网络进行月球地面分段

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-09 DOI: 10.1007/s00138-024-01533-3

Georgios Petrakis, Panagiotis Partsinevelos

Semantic segmentation plays a significant role in unstructured and planetary scene understanding, offering to a robotic system or a planetary rover valuable knowledge about its surroundings. Several studies investigate rover-based scene recognition planetary-like environments but there is a lack of a semantic segmentation architecture, focused on computing systems with low resources and tested on the lunar surface. In this study, a lightweight encoder-decoder neural network (NN) architecture is proposed for rover-based ground segmentation on the lunar surface. The proposed architecture is composed by a modified MobilenetV2 as encoder and a lightweight U-net decoder while the training and evaluation process were conducted using a publicly available synthetic dataset with lunar landscape images. The proposed model provides robust segmentation results, allowing the lunar scene understanding focused on rocks and boulders. It achieves similar accuracy, compared with original U-net and U-net-based architectures which are 110–140 times larger than the proposed architecture. This study, aims to contribute in lunar landscape segmentation utilizing deep learning techniques, while it proves a great potential in autonomous lunar navigation ensuring a safer and smoother navigation on the moon. To the best of our knowledge, this is the first study which propose a lightweight semantic segmentation architecture for the lunar surface, aiming to reinforce the autonomous rover navigation.

语义分割在非结构化和行星场景理解中发挥着重要作用，为机器人系统或行星漫游车提供了有关其周围环境的宝贵知识。有几项研究对基于漫游车的类地行星环境场景识别进行了调查，但目前还缺乏一种语义分割架构，这种架构主要针对资源较少的计算系统，并在月球表面进行了测试。本研究提出了一种轻量级编码器-解码器神经网络（NN）架构，用于月球表面基于漫游车的地面分割。所提出的架构由修改后的 MobilenetV2 编码器和轻量级 U-net 解码器组成，并使用公开的月球景观图像合成数据集进行训练和评估。所提出的模型提供了稳健的分割结果，使月球场景的理解能够集中在岩石和巨石上。与原始 U-net 和基于 U-net 的架构相比，该模型达到了相似的准确度，而原始 U-net 和基于 U-net 的架构要比所提出的架构大 110-140 倍。这项研究旨在利用深度学习技术为月球景观分割做出贡献，同时证明它在月球自主导航方面具有巨大潜力，可确保在月球上更安全、更顺畅地导航。据我们所知，这是第一项为月球表面提出轻量级语义分割架构的研究，旨在加强月球车的自主导航。

{"title":"Lunar ground segmentation using a modified U-net neural network","authors":"Georgios Petrakis, Panagiotis Partsinevelos","doi":"10.1007/s00138-024-01533-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01533-3","url":null,"abstract":"Semantic segmentation plays a significant role in unstructured and planetary scene understanding, offering to a robotic system or a planetary rover valuable knowledge about its surroundings. Several studies investigate rover-based scene recognition planetary-like environments but there is a lack of a semantic segmentation architecture, focused on computing systems with low resources and tested on the lunar surface. In this study, a lightweight encoder-decoder neural network (NN) architecture is proposed for rover-based ground segmentation on the lunar surface. The proposed architecture is composed by a modified MobilenetV2 as encoder and a lightweight U-net decoder while the training and evaluation process were conducted using a publicly available synthetic dataset with lunar landscape images. The proposed model provides robust segmentation results, allowing the lunar scene understanding focused on rocks and boulders. It achieves similar accuracy, compared with original U-net and U-net-based architectures which are 110–140 times larger than the proposed architecture. This study, aims to contribute in lunar landscape segmentation utilizing deep learning techniques, while it proves a great potential in autonomous lunar navigation ensuring a safer and smoother navigation on the moon. To the best of our knowledge, this is the first study which propose a lightweight semantic segmentation architecture for the lunar surface, aiming to reinforce the autonomous rover navigation.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"65 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DisRot: boosting the generalization capability of few-shot learning via knowledge distillation and self-supervised learning DisRot：通过知识提炼和自我监督学习提高少量学习的泛化能力

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-09 DOI: 10.1007/s00138-024-01529-z

Chenyu Ma, Jinfang Jia, Jianqiang Huang, Li Wu, Xiaoying Wang

Few-shot learning (FSL) aims to adapt quickly to new categories with limited samples. Despite significant progress in utilizing meta-learning for solving FSL tasks, challenges such as overfitting and poor generalization still exist. Building upon the demonstrated significance of powerful feature representation, this work proposes disRot, a novel two-strategy training mechanism, which combines knowledge distillation and rotation prediction task for the pre-training phase of transfer learning. Knowledge distillation enables shallow networks to learn relational knowledge contained in deep networks, while the self-supervised rotation prediction task provides class-irrelevant and transferable knowledge for the supervised task. Simultaneous optimization for these two tasks allows the model learn generalizable and transferable feature embedding. Extensive experiments on the miniImageNet and FC100 datasets demonstrate that disRot can effectively improve the generalization ability of the model and is comparable to the leading FSL methods.

少量学习（FSL）旨在利用有限的样本快速适应新的类别。尽管在利用元学习解决 FSL 任务方面取得了重大进展，但过度拟合和泛化能力差等挑战依然存在。基于已证明的强大特征表示的重要性，本研究提出了一种新颖的双策略训练机制--disRot，它将知识蒸馏和旋转预测任务相结合，用于迁移学习的预训练阶段。知识蒸馏使浅层网络能够学习深层网络中包含的关系知识，而自监督旋转预测任务则为监督任务提供与类无关的可迁移知识。针对这两项任务的同步优化使模型能够学习可通用和可转移的特征嵌入。在 miniImageNet 和 FC100 数据集上进行的大量实验证明，disRot 可以有效提高模型的泛化能力，并可与领先的 FSL 方法相媲美。

引用次数: 0

The improvement of ground truth annotation in public datasets for human detection 改进公共数据集的地面实况标注以进行人类探测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-08 DOI: 10.1007/s00138-024-01527-1

Sotheany Nou, Joong-Sun Lee, Nagaaki Ohyama, Takashi Obi

The quality of annotations in the datasets is crucial for supervised machine learning as it significantly affects the performance of models. While many public datasets are widely used, they often suffer from annotations errors, including missing annotations, incorrect bounding box sizes, and positions. It results in low accuracy of machine learning models. However, most researchers have traditionally focused on improving model performance by enhancing algorithms, while overlooking concerns regarding data quality. This so-called model-centric AI approach has been predominant. In contrast, a data-centric AI approach, advocated by Andrew Ng at the DATA and AI Summit 2022, emphasizes enhancing data quality while keeping the model fixed, which proves to be more efficient in improving performance. Building upon this data-centric approach, we propose a method to enhance the quality of public datasets such as MS-COCO and Open Image Dataset. Our approach involves automatically retrieving missing annotations and correcting the size and position of existing bounding boxes in these datasets. Specifically, our study deals with human object detection, which is one of the prominent applications of artificial intelligence. Experimental results demonstrate improved performance with models such as Faster-RCNN, EfficientDet, and RetinaNet. We can achieve up to 32% compared to original datasets in the term of mAP after applying both proposed methods to dataset which is transformed the grouped of instances to individual instance. In summary, our methods significantly enhance the model’s performance by improving the quality of annotations at a lower cost with less time than manual improvement employed in other studies.

数据集注释的质量对监督式机器学习至关重要，因为它会极大地影响模型的性能。虽然许多公共数据集被广泛使用，但它们往往存在注释错误，包括注释缺失、边界框大小和位置不正确。这导致机器学习模型的准确率很低。然而，大多数研究人员传统上都专注于通过增强算法来提高模型性能，而忽略了对数据质量的关注。这种所谓的以模型为中心的人工智能方法一直占主导地位。相比之下，吴恩达（Andrew Ng）在 2022 年数据与人工智能峰会上倡导的以数据为中心的人工智能方法则强调在保持模型固定的同时提高数据质量，事实证明这种方法能更有效地提高性能。基于这种以数据为中心的方法，我们提出了一种提高 MS-COCO 和开放图像数据集等公共数据集质量的方法。我们的方法包括在这些数据集中自动检索缺失的注释并修正现有边界框的大小和位置。具体来说，我们的研究涉及人类物体检测，这是人工智能的重要应用之一。实验结果表明，Faster-RCNN、EfficientDet 和 RetinaNet 等模型的性能得到了提高。在将分组实例转换为单个实例的数据集上应用这两种方法后，我们的 mAP 值与原始数据集相比提高了 32%。总之，与其他研究中采用的人工改进方法相比，我们的方法以更低的成本和更少的时间提高了注释的质量，从而大大提高了模型的性能。

{"title":"The improvement of ground truth annotation in public datasets for human detection","authors":"Sotheany Nou, Joong-Sun Lee, Nagaaki Ohyama, Takashi Obi","doi":"10.1007/s00138-024-01527-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01527-1","url":null,"abstract":"The quality of annotations in the datasets is crucial for supervised machine learning as it significantly affects the performance of models. While many public datasets are widely used, they often suffer from annotations errors, including missing annotations, incorrect bounding box sizes, and positions. It results in low accuracy of machine learning models. However, most researchers have traditionally focused on improving model performance by enhancing algorithms, while overlooking concerns regarding data quality. This so-called model-centric AI approach has been predominant. In contrast, a data-centric AI approach, advocated by Andrew Ng at the DATA and AI Summit 2022, emphasizes enhancing data quality while keeping the model fixed, which proves to be more efficient in improving performance. Building upon this data-centric approach, we propose a method to enhance the quality of public datasets such as MS-COCO and Open Image Dataset. Our approach involves automatically retrieving missing annotations and correcting the size and position of existing bounding boxes in these datasets. Specifically, our study deals with human object detection, which is one of the prominent applications of artificial intelligence. Experimental results demonstrate improved performance with models such as Faster-RCNN, EfficientDet, and RetinaNet. We can achieve up to 32% compared to original datasets in the term of mAP after applying both proposed methods to dataset which is transformed the grouped of instances to individual instance. In summary, our methods significantly enhance the model’s performance by improving the quality of annotations at a lower cost with less time than manual improvement employed in other studies.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"23 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tensor-guided learning for image denoising using anisotropic PDEs 利用各向异性 PDEs 进行图像去噪的张量引导学习

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-08 DOI: 10.1007/s00138-024-01532-4

Fakhr-eddine Limami, Aissam Hadri, Lekbir Afraites, Amine Laghrib

In this article, we introduce an advanced approach for enhanced image denoising using an improved space-variant anisotropic Partial Differential Equation (PDE) framework. Leveraging Weickert-type operators, this method relies on two critical parameters: (lambda ) and (theta ), defining local image geometry and smoothing strength. We propose an automatic parameter estimation technique rooted in PDE-constrained optimization, incorporating supplementary information from the original clean image. By combining these components, our approach achieves superior image denoising, pushing the boundaries of image enhancement methods. We employed a modified Alternating Direction Method of Multipliers (ADMM) procedure for numerical optimization, demonstrating its efficacy through thorough assessments and affirming its superior performance compared to alternative denoising methods.

在本文中，我们介绍了一种利用改进的空间变异各向异性偏微分方程（PDE）框架增强图像去噪的先进方法。利用魏克特型算子，该方法依赖于两个关键参数：(lambda ) 和 (theta )，这两个参数定义了局部图像的几何形状和平滑强度。我们提出了一种植根于 PDE 受限优化的自动参数估计技术，并结合了来自原始干净图像的补充信息。通过结合这些组件，我们的方法实现了卓越的图像去噪，推动了图像增强方法的发展。我们采用了改进的交替方向乘法（ADMM）程序进行数值优化，通过全面评估证明了其有效性，并肯定了它与其他去噪方法相比的优越性能。

引用次数: 0

Keyframe-based RGB-D dense visual SLAM fused semantic cues in dynamic scenes 基于关键帧的 RGB-D 密集视觉 SLAM 融合动态场景中的语义线索

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-07 DOI: 10.1007/s00138-024-01526-2

Wugen Zhou, Xiaodong Peng, Yun Li, Mingrui Fan, Bo Liu

The robustness of dense visual SLAM is still a challenging problem in dynamic environments. In this paper, we propose a novel keyframe-based dense visual SLAM to handle a highly dynamic environment by using an RGB-D camera. The proposed method uses cluster-based residual models and semantic cues to detect dynamic objects, resulting in motion segmentation that outperforms traditional methods. The method also employs motion-segmentation based keyframe selection strategies and frame-to-keyframe matching scheme that reduce the influence of dynamic objects, thus minimizing trajectory errors. We further filter out dynamic object influence based on motion segmentation and then employ true matches from keyframes, which are near the current keyframe, to facilitate loop closure. Finally, a pose graph is established and optimized using the g2o framework. Our experimental results demonstrate the success of our approach in handling highly dynamic sequences, as evidenced by the more robust motion segmentation results and significantly lower trajectory drift compared to several state-of-the-art dense visual odometry or SLAM methods on challenging public benchmark datasets.

在动态环境中，密集视觉 SLAM 的鲁棒性仍然是一个具有挑战性的问题。在本文中，我们提出了一种新颖的基于关键帧的密集视觉 SLAM，利用 RGB-D 摄像机来处理高动态环境。所提出的方法利用基于聚类的残差模型和语义线索来检测动态物体，从而实现优于传统方法的运动分割。该方法还采用了基于运动分割的关键帧选择策略和帧到关键帧匹配方案，以减少动态物体的影响，从而将轨迹误差降至最低。我们在运动分割的基础上进一步过滤掉动态物体的影响，然后采用与当前关键帧附近的关键帧的真实匹配，以促进循环闭合。最后，我们使用 g2o 框架建立并优化了姿势图。我们的实验结果证明了我们的方法在处理高动态序列方面的成功，在具有挑战性的公共基准数据集上，与几种最先进的密集视觉里程测量或 SLAM 方法相比，我们的运动分割结果更加稳健，轨迹漂移明显降低。

{"title":"Keyframe-based RGB-D dense visual SLAM fused semantic cues in dynamic scenes","authors":"Wugen Zhou, Xiaodong Peng, Yun Li, Mingrui Fan, Bo Liu","doi":"10.1007/s00138-024-01526-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01526-2","url":null,"abstract":"The robustness of dense visual SLAM is still a challenging problem in dynamic environments. In this paper, we propose a novel keyframe-based dense visual SLAM to handle a highly dynamic environment by using an RGB-D camera. The proposed method uses cluster-based residual models and semantic cues to detect dynamic objects, resulting in motion segmentation that outperforms traditional methods. The method also employs motion-segmentation based keyframe selection strategies and frame-to-keyframe matching scheme that reduce the influence of dynamic objects, thus minimizing trajectory errors. We further filter out dynamic object influence based on motion segmentation and then employ true matches from keyframes, which are near the current keyframe, to facilitate loop closure. Finally, a pose graph is established and optimized using the g2o framework. Our experimental results demonstrate the success of our approach in handling highly dynamic sequences, as evidenced by the more robust motion segmentation results and significantly lower trajectory drift compared to several state-of-the-art dense visual odometry or SLAM methods on challenging public benchmark datasets.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"53 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0