Machine Vision and Applications最新文献_第8页

Multi-person 3D pose estimation from unlabelled data 从无标签数据中估算多人三维姿态

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-06 DOI: 10.1007/s00138-024-01530-6

Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.

多人体三维姿态估算应用广泛，是一个极具影响力的研究领域。然而，它也面临着一些挑战，尤其是在使用多视图和普通 RGB 摄像机作为唯一输入时。首先，必须在不同视图中唯一识别每个人。其次，它必须对噪声、部分遮挡和可能检测不到人的视图具有鲁棒性。第三，许多姿势估计方法都依赖于特定环境的注释数据集，而这些数据集往往过于昂贵，并且/或者需要专用硬件。具体来说，这是第一种不需要注释数据集的多摄像头、多人物数据驱动方法。在这项工作中，我们借助自监督学习来应对这三个挑战。特别是，我们提出了一个三阶段管道和一个严格的评估，证明我们的方法比其他最先进的算法执行得更快，准确度相当，最重要的是，不需要注释数据集。该管道由二维骨架检测步骤和图神经网络组成，图神经网络用于估算场景中人物的跨视角对应关系，多层感知器则将二维信息转换为三维姿态估算。我们的建议包括最后两个步骤，它兼容任何作为输入的二维骨骼检测器。这两个模型以自我监督的方式进行训练，因此无需使用标注了三维真实姿势的数据集。

{"title":"Multi-person 3D pose estimation from unlabelled data","authors":"Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso","doi":"10.1007/s00138-024-01530-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01530-6","url":null,"abstract":"Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

USIR-Net: sand-dust image restoration based on unsupervised learning USIR-Net：基于无监督学习的沙尘图像修复技术

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-01 DOI: 10.1007/s00138-024-01528-0

Yuan Ding, Kaijun Wu

In sand-dust weather, the influence of sand-dust particles on imaging equipment often results in images with color deviation, blurring, and low contrast, among other issues. These problems making many traditional image restoration methods unable to accurately estimate the semantic information of the images and consequently resulting in poor restoration of clear images. Most current image restoration methods in the field of deep learning are based on supervised learning, which requires pairing and labeling a large amount of data, and the possibility of manual annotation errors. In light of this, we propose an unsupervised sand-dust image restoration network. The overall model adopts an improved CycleGAN to fit unpaired sand-dust images. Firstly, multiscale skip connections in the multiscale cascaded attention module are used to enhance the feature fusion effect after downsampling. Secondly, multi-head convolutional attention with multiple input concatenations is employed, with each head using different kernel sizes to improve the ability to restore detail information. Finally, the adaptive decoder-encoder module is used to achieve adaptive fitting of the model and output the restored image. According to the experiments conducted on the dataset, the qualitative and quantitative indicators of USIR-Net are superior to the selected comparison algorithms, furthermore, in additional experiments conducted on haze removal and underwater image enhancement, we have demonstrated the wide applicability of our model.

在沙尘天气中，沙尘颗粒对成像设备的影响往往会导致图像出现色彩偏差、模糊和对比度低等问题。这些问题使得许多传统的图像复原方法无法准确估计图像的语义信息，从而导致清晰图像的还原效果不佳。目前深度学习领域的图像修复方法大多基于监督学习，需要对大量数据进行配对和标注，并可能出现人工标注错误。有鉴于此，我们提出了一种无监督的沙尘图像修复网络。整体模型采用改进的 CycleGAN 来拟合未配对的沙尘图像。首先，多尺度级联注意模块中的多尺度跳转连接用于增强下采样后的特征融合效果。其次，采用了多头卷积注意力和多输入串联，每个头使用不同的核大小，以提高细节信息的还原能力。最后，利用自适应解码器-编码器模块实现模型的自适应拟合，输出修复后的图像。根据在数据集上进行的实验，USIR-Net 的定性和定量指标均优于所选的对比算法，此外，在去除雾霾和水下图像增强的附加实验中，我们也证明了我们的模型具有广泛的适用性。

{"title":"USIR-Net: sand-dust image restoration based on unsupervised learning","authors":"Yuan Ding, Kaijun Wu","doi":"10.1007/s00138-024-01528-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01528-0","url":null,"abstract":"In sand-dust weather, the influence of sand-dust particles on imaging equipment often results in images with color deviation, blurring, and low contrast, among other issues. These problems making many traditional image restoration methods unable to accurately estimate the semantic information of the images and consequently resulting in poor restoration of clear images. Most current image restoration methods in the field of deep learning are based on supervised learning, which requires pairing and labeling a large amount of data, and the possibility of manual annotation errors. In light of this, we propose an unsupervised sand-dust image restoration network. The overall model adopts an improved CycleGAN to fit unpaired sand-dust images. Firstly, multiscale skip connections in the multiscale cascaded attention module are used to enhance the feature fusion effect after downsampling. Secondly, multi-head convolutional attention with multiple input concatenations is employed, with each head using different kernel sizes to improve the ability to restore detail information. Finally, the adaptive decoder-encoder module is used to achieve adaptive fitting of the model and output the restored image. According to the experiments conducted on the dataset, the qualitative and quantitative indicators of USIR-Net are superior to the selected comparison algorithms, furthermore, in additional experiments conducted on haze removal and underwater image enhancement, we have demonstrated the wide applicability of our model.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"94 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ssman: self-supervised masked adaptive network for 3D human pose estimation Ssman：用于三维人体姿态估计的自监督屏蔽自适应网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-27 DOI: 10.1007/s00138-024-01514-6

Abstract

The modern deep learning-based models for 3D human pose estimation from monocular images always lack the adaption ability between occlusion and non-occlusion scenarios, which might restrict the performance of current methods when faced with various scales of occluded conditions. In an attempt to tackle this problem, we propose a novel network called self-supervised masked adaptive network (SSMAN). Firstly, we leverage different levels of masks to cover the richness of occlusion in fully in-the-wild environment. Then, we design a multi-line adaptive network, which could be trained with various scales of masked images in parallel. Based on this masked adaptive network, we train it with self-supervised learning to enforce the consistency across the outputs under different mask ratios. Furthermore, a global refinement module is proposed to leverage global features of the human body to refine the human pose estimated solely by local features. We perform extensive experiments both on the occlusion datasets like 3DPW-OCC and OCHuman and general datasets such as Human3.6M and 3DPW. The results show that SSMAN achieves new state-of-the-art performance on both lightly and heavily occluded benchmarks and is highly competitive with significant improvement on standard benchmarks.

摘要基于深度学习的现代单目图像三维人体姿态估计模型总是缺乏遮挡与非遮挡场景之间的自适应能力，这可能会限制当前方法在面对各种规模的遮挡条件时的性能。为了解决这一问题，我们提出了一种名为自监督遮挡自适应网络（SSMAN）的新型网络。首先，我们利用不同层次的遮挡来覆盖完全野外环境中丰富的遮挡情况。然后，我们设计了一个多线自适应网络，可以并行地使用不同尺度的遮蔽图像进行训练。在此遮蔽自适应网络的基础上，我们通过自监督学习对其进行训练，以确保不同遮蔽比例下输出的一致性。此外，我们还提出了一个全局细化模块，利用人体的全局特征来细化仅由局部特征估算出的人体姿态。我们在 3DPW-OCC 和 OCHuman 等遮挡数据集以及 Human3.6M 和 3DPW 等一般数据集上进行了大量实验。结果表明，无论是在轻度还是重度遮挡基准上，SSMAN 都取得了新的一流性能，而且在标准基准上也有显著改进，具有很强的竞争力。

{"title":"Ssman: self-supervised masked adaptive network for 3D human pose estimation","authors":"","doi":"10.1007/s00138-024-01514-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01514-6","url":null,"abstract":"<h3>Abstract</h3> The modern deep learning-based models for 3D human pose estimation from monocular images always lack the adaption ability between occlusion and non-occlusion scenarios, which might restrict the performance of current methods when faced with various scales of occluded conditions. In an attempt to tackle this problem, we propose a novel network called self-supervised masked adaptive network (SSMAN). Firstly, we leverage different levels of masks to cover the richness of occlusion in fully in-the-wild environment. Then, we design a multi-line adaptive network, which could be trained with various scales of masked images in parallel. Based on this masked adaptive network, we train it with self-supervised learning to enforce the consistency across the outputs under different mask ratios. Furthermore, a global refinement module is proposed to leverage global features of the human body to refine the human pose estimated solely by local features. We perform extensive experiments both on the occlusion datasets like 3DPW-OCC and OCHuman and general datasets such as Human3.6M and 3DPW. The results show that SSMAN achieves new state-of-the-art performance on both lightly and heavily occluded benchmarks and is highly competitive with significant improvement on standard benchmarks.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Kernel based local matching network for video object segmentation 基于核的局部匹配网络用于视频对象分割

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-25 DOI: 10.1007/s00138-024-01524-4

Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang

Recently, the methods based on space-time memory network have achieved advanced performance in semi-supervised video object segmentation, which has attracted wide attention. However, this kind of methods still have a fatal limitation. It has the interference problem of similar objects caused by the way of non-local matching, which seriously limits the performance of video object segmentation. To solve this problem, we propose a Kernel-guided Attention Matching Network (KAMNet) by the use of local matching instead of non-local matching. At first, KAMNet uses spatio-temporal attention mechanism to enhance the model’s discrimination between foreground objects and background areas. Then KAMNet utilizes gaussian kernel to guide the matching between the current frame and the reference set. Because the gaussian kernel decays away from the center, it can limit the matching to the central region, thus achieving local matching. Our KAMNet gets speed-accuracy trade-off on benchmark datasets DAVIS 2016 (( mathcal {J & F}) of 87.6%) and DAVIS 2017 (( mathcal {J & F}) of 76.0%) with 0.12 second per frame.

最近，基于时空记忆网络的方法在半监督视频对象分割方面取得了先进的性能，引起了广泛关注。但是，这种方法仍然存在致命的局限性。它存在非局部匹配方式导致的相似对象干扰问题，严重限制了视频对象分割的性能。为了解决这个问题，我们提出了一种利用局部匹配代替非局部匹配的内核引导注意力匹配网络（KAMNet）。首先，KAMNet 利用时空注意力机制来增强模型对前景物体和背景区域的辨别能力。然后，KAMNet 利用高斯核引导当前帧与参考集之间的匹配。由于高斯核从中心开始衰减，它可以将匹配限制在中心区域，从而实现局部匹配。我们的 KAMNet 在基准数据集 DAVIS 2016（87.6%）和 DAVIS 2017（76.0%）上实现了速度与精度的权衡，每帧耗时 0.12 秒。

{"title":"Kernel based local matching network for video object segmentation","authors":"Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang","doi":"10.1007/s00138-024-01524-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01524-4","url":null,"abstract":"Recently, the methods based on space-time memory network have achieved advanced performance in semi-supervised video object segmentation, which has attracted wide attention. However, this kind of methods still have a fatal limitation. It has the interference problem of similar objects caused by the way of non-local matching, which seriously limits the performance of video object segmentation. To solve this problem, we propose a Kernel-guided Attention Matching Network (KAMNet) by the use of local matching instead of non-local matching. At first, KAMNet uses spatio-temporal attention mechanism to enhance the model’s discrimination between foreground objects and background areas. Then KAMNet utilizes gaussian kernel to guide the matching between the current frame and the reference set. Because the gaussian kernel decays away from the center, it can limit the matching to the central region, thus achieving local matching. Our KAMNet gets speed-accuracy trade-off on benchmark datasets DAVIS 2016 (( mathcal {J & F}) of 87.6%) and DAVIS 2017 (( mathcal {J & F}) of 76.0%) with 0.12 second per frame.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"46 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark 利用无特征基线和无偏基准解决三维注册方法的通用性问题

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-23 DOI: 10.1007/s00138-024-01510-w

David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić

Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.

最新的三维配准方法大多基于学习，要么在特征空间中找到对应点并进行匹配，要么直接从给定的点云特征中估算配准变换。因此，这些基于特征的方法很难推广到与其训练数据有很大差异的点云上。由于基准定义存在问题，无法提供任何深入分析，且偏向于类似数据，因此这一问题并不明显。因此，我们提出了一种创建三维注册基准的方法，给定一个点云数据集，该数据集可提供一种方法相对于其他基准的更翔实的评估。利用这种方法，我们在 FAUST 数据集的基础上创建了一个具有多个难度级别的新型 FAUST-partial（FP）基准。FP 基准解决了当前基准的局限性：缺乏数据和参数范围的可变性，并允许在单一注册参数方面评估三维注册方法的优缺点。通过使用新的 FP 基准，我们对当前最先进的方法进行了全面分析，发现当前的方法仍难以推广到严重不同的样本外数据上。因此，我们基于两个给定点云之间的加权交叉相关性，提出了一种简单的无特征传统三维注册基线方法。我们的方法在当前的基准数据集上取得了很好的效果，优于大多数深度学习方法。我们的源代码可在 github.com/DavidBoja/exhaustive-grid-search 上获取。

{"title":"Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark","authors":"David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić","doi":"10.1007/s00138-024-01510-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01510-w","url":null,"abstract":"Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2015 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection AFMCT：基于跨模态变换块的自适应融合模块，用于三维物体检测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-23 DOI: 10.1007/s00138-024-01509-3

Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang

Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.

激光雷达和摄像头是自动驾驶环境感知的重要传感器。然而，如何充分融合来自多个来源的异构数据仍是一项艰巨的挑战。因此，基于多模态传感器融合的 3D 物体检测往往不如仅基于激光雷达的单模态方法，这表明多传感器机器视觉仍有待发展。本文提出了一种基于跨模态变换块（AFMCT）的自适应融合模块，利用双向增强策略进行三维物体检测。具体来说，我们首先通过提取基于跨模态变换块的注意力点特征来增强图像特征，并以串联的方式将它们连接起来；然后，另一个跨模态变换块作用于增强后的图像特征，以图像语义信息来强化点特征。在 KITTI 数据集的三维检测基准上进行的大量实验表明，我们提出的结构可以显著提高纯激光雷达方法的检测精度，比现有的先进多传感器融合模块至少高出 0.45%，这表明我们的方法可能是基于多传感器融合改进三维物体检测的可行解决方案。

{"title":"AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection","authors":"Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang","doi":"10.1007/s00138-024-01509-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01509-3","url":null,"abstract":"Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"30 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hyperspectral image dynamic range reconstruction using deep neural network-based denoising methods 利用基于深度神经网络的去噪方法重建高光谱图像动态范围

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-22 DOI: 10.1007/s00138-024-01523-5

Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp

Hyperspectral (HS) measurement is among the most useful tools in agriculture for early disease detection. However, the cost of HS cameras that can perform the desired detection tasks is prohibitive-typically fifty thousand to hundreds of thousands of dollars. In a previous study at the Agricultural Research Organization’s Volcani Institute (Israel), a low-cost, high-performing HS system was developed which included a point spectrometer and optical components. Its main disadvantage was long shooting time for each image. Shooting time strongly depends on the predetermined integration time of the point spectrometer. While essential for performing monitoring tasks in a reasonable time, shortening integration time from a typical value in the range of 200 ms to the 10 ms range results in deterioration of the dynamic range of the captured scene. In this work, we suggest correcting this by learning the transformation from data measured with short integration time to that measured with long integration time. Reduction of the dynamic range and consequent low SNR were successfully overcome using three developed deep neural networks models based on a denoising auto-encoder, DnCNN and LambdaNetworks architectures as a backbone. The best model was based on DnCNN using a combined loss function of (ell _{2}) and Kullback–Leibler divergence on images with 20 consecutive channels. The full spectrum of the model achieved a mean PSNR of 30.61 and mean SSIM of 0.9, showing total improvement relatively to the 10 ms measurements’ mean PSNR and mean SSIM values by 60.43% and 94.51%, respectively.

高光谱（HS）测量是农业领域早期疾病检测最有用的工具之一。然而，能够完成所需检测任务的高光谱相机成本高昂，通常需要五万到数十万美元。在以色列农业研究组织 Volcani 研究所之前进行的一项研究中，开发了一种低成本、高性能的 HS 系统，其中包括一个点光谱仪和光学元件。其主要缺点是每幅图像的拍摄时间较长。拍摄时间在很大程度上取决于点光谱仪的预定积分时间。虽然在合理的时间内执行监测任务非常重要，但将积分时间从 200 毫秒的典型值缩短到 10 毫秒范围内会导致拍摄场景的动态范围恶化。在这项工作中，我们建议通过学习从短积分时间测量的数据到长积分时间测量的数据之间的转换来纠正这种情况。以去噪自动编码器、DnCNN 和 LambdaNetworks 架构为骨干，利用开发的三种深度神经网络模型，成功克服了动态范围减小和随之而来的低信噪比问题。最好的模型是基于 DnCNN 的模型，在 20 个连续通道的图像上使用了 (ell _{2}) 和 Kullback-Leibler 发散的组合损失函数。该模型的全光谱平均 PSNR 为 30.61，平均 SSIM 为 0.9，与 10 ms 测量值相比，平均 PSNR 和平均 SSIM 值分别提高了 60.43% 和 94.51%。

{"title":"Hyperspectral image dynamic range reconstruction using deep neural network-based denoising methods","authors":"Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp","doi":"10.1007/s00138-024-01523-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01523-5","url":null,"abstract":"Hyperspectral (HS) measurement is among the most useful tools in agriculture for early disease detection. However, the cost of HS cameras that can perform the desired detection tasks is prohibitive-typically fifty thousand to hundreds of thousands of dollars. In a previous study at the Agricultural Research Organization’s Volcani Institute (Israel), a low-cost, high-performing HS system was developed which included a point spectrometer and optical components. Its main disadvantage was long shooting time for each image. Shooting time strongly depends on the predetermined integration time of the point spectrometer. While essential for performing monitoring tasks in a reasonable time, shortening integration time from a typical value in the range of 200 ms to the 10 ms range results in deterioration of the dynamic range of the captured scene. In this work, we suggest correcting this by learning the transformation from data measured with short integration time to that measured with long integration time. Reduction of the dynamic range and consequent low SNR were successfully overcome using three developed deep neural networks models based on a denoising auto-encoder, DnCNN and LambdaNetworks architectures as a backbone. The best model was based on DnCNN using a combined loss function of (ell _{2}) and Kullback–Leibler divergence on images with 20 consecutive channels. The full spectrum of the model achieved a mean PSNR of 30.61 and mean SSIM of 0.9, showing total improvement relatively to the 10 ms measurements’ mean PSNR and mean SSIM values by 60.43% and 94.51%, respectively.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"25 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Point cloud registration with quantile assignment 利用量化赋值进行点云注册

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-19 DOI: 10.1007/s00138-024-01517-3

Ecenur Oğuz, Yalım Doğan, Uğur Güdükbay, Oya Karaşan, Mustafa Pınar

Point cloud registration is a fundamental problem in computer vision. The problem encompasses critical tasks such as feature estimation, correspondence matching, and transformation estimation. The point cloud registration problem can be cast as a quantile matching problem. We refined the quantile assignment algorithm by integrating prevalent feature descriptors and transformation estimation methods to enhance the correspondence between the source and target point clouds. We evaluated the performances of these descriptors and methods with our approach through controlled experiments on a dataset we constructed using well-known 3D models. This systematic investigation led us to identify the most suitable methods for complementing our approach. Subsequently, we devised a new end-to-end, coarse-to-fine pairwise point cloud registration framework. Finally, we tested our framework on indoor and outdoor benchmark datasets and compared our results with state-of-the-art point cloud registration methods.

点云注册是计算机视觉中的一个基本问题。该问题包括特征估计、对应匹配和变换估计等关键任务。点云注册问题可以看作是一个量化匹配问题。我们通过整合流行的特征描述和变换估计方法，改进了量子分配算法，以提高源点云和目标点云之间的对应性。我们通过在利用著名 3D 模型构建的数据集上进行对照实验，评估了这些描述符和方法与我们的方法的性能。这一系统性调查使我们确定了最适合补充我们方法的方法。随后，我们设计了一个新的端到端、粗到细的点云配对注册框架。最后，我们在室内和室外基准数据集上测试了我们的框架，并将结果与最先进的点云配准方法进行了比较。

引用次数: 0

An image quality assessment method based on edge extraction and singular value for blurriness 基于边缘提取和奇异值的模糊度图像质量评估方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-19 DOI: 10.1007/s00138-024-01522-6

Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim

The automatic assessment of perceived image quality is crucial in the field of image processing. To achieve this idea, we propose an image quality assessment (IQA) method for blurriness. The features of gradient and singular value were extracted in this method instead of the single feature in the traditional IQA algorithms. According to the insufficient size of existing public image quality assessment datasets to support deep learning, machine learning was introduced to fuse the features of multiple domains, and a new no-reference (NR) IQA method for blurriness denoted Feature fusion IQA(Ffu-IQA) was proposed. The Ffu-IQA uses a probabilistic model to estimate the probability of each edge detection blur in the image, and then uses machine learning to aggregate the probability information to obtain the edge quality score. After that uses the singular value obtained by singular value decomposition of the image matrix to calculate the singular value score. Finally, machine learning pooling is used to obtain the true quality score. Ffu-IQA achieves PLCC scores of 0.9570 and 0.9616 on CSIQ and TID2013, respectively, and SROCC scores of 0.9380 and 0.9531, which are better than most traditional image quality assessment methods for blurriness.

自动评估感知图像质量在图像处理领域至关重要。为此，我们提出了一种模糊度图像质量评估（IQA）方法。该方法提取了梯度和奇异值特征，而非传统 IQA 算法中的单一特征。针对现有公开图像质量评估数据集的规模不足以支持深度学习的问题，引入机器学习来融合多个域的特征，提出了一种新的无参考（NR）模糊度 IQA 方法，即特征融合 IQA（Ffu-IQA）。Ffu-IQA 利用概率模型估计图像中每个边缘检测模糊的概率，然后利用机器学习将概率信息汇总，得到边缘质量得分。然后利用图像矩阵奇异值分解得到的奇异值计算奇异值得分。最后，通过机器学习池化得到真实质量得分。Ffu-IQA 在 CSIQ 和 TID2013 上的 PLCC 得分分别为 0.9570 和 0.9616，SROCC 得分分别为 0.9380 和 0.9531，在模糊度方面优于大多数传统图像质量评估方法。

{"title":"An image quality assessment method based on edge extraction and singular value for blurriness","authors":"Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim","doi":"10.1007/s00138-024-01522-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01522-6","url":null,"abstract":"The automatic assessment of perceived image quality is crucial in the field of image processing. To achieve this idea, we propose an image quality assessment (IQA) method for blurriness. The features of gradient and singular value were extracted in this method instead of the single feature in the traditional IQA algorithms. According to the insufficient size of existing public image quality assessment datasets to support deep learning, machine learning was introduced to fuse the features of multiple domains, and a new no-reference (NR) IQA method for blurriness denoted Feature fusion IQA(Ffu-IQA) was proposed. The Ffu-IQA uses a probabilistic model to estimate the probability of each edge detection blur in the image, and then uses machine learning to aggregate the probability information to obtain the edge quality score. After that uses the singular value obtained by singular value decomposition of the image matrix to calculate the singular value score. Finally, machine learning pooling is used to obtain the true quality score. Ffu-IQA achieves PLCC scores of 0.9570 and 0.9616 on CSIQ and TID2013, respectively, and SROCC scores of 0.9380 and 0.9531, which are better than most traditional image quality assessment methods for blurriness.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal teacher with masked transformers for semi-supervised action proposal generation 半监督行动建议生成时态教师与遮蔽变换器

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-15 DOI: 10.1007/s00138-024-01521-7

Selen Pehlivan, Jorma Laaksonen

By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.

通过以单元级预测为条件，用于生成行动建议的无锚模型已显示出令人印象深刻的能力，如轻量级架构。然而，任务性能在很大程度上取决于训练中使用的数据质量，而大多数有效的模型都依赖于人类标注的数据。半监督学习，即使用标注数据集和未标注数据集联合训练深度神经网络，最近取得了重大进展。现有的研究要么主要关注分类任务，这可能需要较少的标注工作，要么考虑基于锚的检测模型。受无锚对象检测器半监督方法最新进展的启发，我们提出了一个两阶段动作检测管道的师生框架，命名为 "带屏蔽变换器的时态教师"（TTMT），以基于无锚变换器模型生成高质量的动作建议。该模型利用一致性学习作为一种自我训练技术，以互惠互利的方式联合训练无锚学生模型和渐进教师模型。作为核心模型，我们设计了一个基于转换器的无锚模型，以提高时态评估的有效性。我们整合了双向掩码，并为序列设计了仅用于编码器的掩码变换器。通过对边界位置和各种基于局部片段的特征进行联合训练，我们的模型通过提议的评分函数进行预测，从而生成提案候选。在 THUMOS14 和 ActivityNet-1.3 基准上进行的实验证明了我们的模型在时态提案生成任务中的有效性。

{"title":"Temporal teacher with masked transformers for semi-supervised action proposal generation","authors":"Selen Pehlivan, Jorma Laaksonen","doi":"10.1007/s00138-024-01521-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01521-7","url":null,"abstract":"By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"67 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0