Pub Date : 2024-04-06DOI: 10.1007/s00138-024-01530-6
Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso
Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.
{"title":"Multi-person 3D pose estimation from unlabelled data","authors":"Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso","doi":"10.1007/s00138-024-01530-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01530-6","url":null,"abstract":"<p>Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1007/s00138-024-01528-0
Yuan Ding, Kaijun Wu
In sand-dust weather, the influence of sand-dust particles on imaging equipment often results in images with color deviation, blurring, and low contrast, among other issues. These problems making many traditional image restoration methods unable to accurately estimate the semantic information of the images and consequently resulting in poor restoration of clear images. Most current image restoration methods in the field of deep learning are based on supervised learning, which requires pairing and labeling a large amount of data, and the possibility of manual annotation errors. In light of this, we propose an unsupervised sand-dust image restoration network. The overall model adopts an improved CycleGAN to fit unpaired sand-dust images. Firstly, multiscale skip connections in the multiscale cascaded attention module are used to enhance the feature fusion effect after downsampling. Secondly, multi-head convolutional attention with multiple input concatenations is employed, with each head using different kernel sizes to improve the ability to restore detail information. Finally, the adaptive decoder-encoder module is used to achieve adaptive fitting of the model and output the restored image. According to the experiments conducted on the dataset, the qualitative and quantitative indicators of USIR-Net are superior to the selected comparison algorithms, furthermore, in additional experiments conducted on haze removal and underwater image enhancement, we have demonstrated the wide applicability of our model.
{"title":"USIR-Net: sand-dust image restoration based on unsupervised learning","authors":"Yuan Ding, Kaijun Wu","doi":"10.1007/s00138-024-01528-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01528-0","url":null,"abstract":"<p>In sand-dust weather, the influence of sand-dust particles on imaging equipment often results in images with color deviation, blurring, and low contrast, among other issues. These problems making many traditional image restoration methods unable to accurately estimate the semantic information of the images and consequently resulting in poor restoration of clear images. Most current image restoration methods in the field of deep learning are based on supervised learning, which requires pairing and labeling a large amount of data, and the possibility of manual annotation errors. In light of this, we propose an unsupervised sand-dust image restoration network. The overall model adopts an improved CycleGAN to fit unpaired sand-dust images. Firstly, multiscale skip connections in the multiscale cascaded attention module are used to enhance the feature fusion effect after downsampling. Secondly, multi-head convolutional attention with multiple input concatenations is employed, with each head using different kernel sizes to improve the ability to restore detail information. Finally, the adaptive decoder-encoder module is used to achieve adaptive fitting of the model and output the restored image. According to the experiments conducted on the dataset, the qualitative and quantitative indicators of USIR-Net are superior to the selected comparison algorithms, furthermore, in additional experiments conducted on haze removal and underwater image enhancement, we have demonstrated the wide applicability of our model.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"94 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-27DOI: 10.1007/s00138-024-01514-6
Abstract
The modern deep learning-based models for 3D human pose estimation from monocular images always lack the adaption ability between occlusion and non-occlusion scenarios, which might restrict the performance of current methods when faced with various scales of occluded conditions. In an attempt to tackle this problem, we propose a novel network called self-supervised masked adaptive network (SSMAN). Firstly, we leverage different levels of masks to cover the richness of occlusion in fully in-the-wild environment. Then, we design a multi-line adaptive network, which could be trained with various scales of masked images in parallel. Based on this masked adaptive network, we train it with self-supervised learning to enforce the consistency across the outputs under different mask ratios. Furthermore, a global refinement module is proposed to leverage global features of the human body to refine the human pose estimated solely by local features. We perform extensive experiments both on the occlusion datasets like 3DPW-OCC and OCHuman and general datasets such as Human3.6M and 3DPW. The results show that SSMAN achieves new state-of-the-art performance on both lightly and heavily occluded benchmarks and is highly competitive with significant improvement on standard benchmarks.
{"title":"Ssman: self-supervised masked adaptive network for 3D human pose estimation","authors":"","doi":"10.1007/s00138-024-01514-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01514-6","url":null,"abstract":"<h3>Abstract</h3> <p>The modern deep learning-based models for 3D human pose estimation from monocular images always lack the adaption ability between occlusion and non-occlusion scenarios, which might restrict the performance of current methods when faced with various scales of occluded conditions. In an attempt to tackle this problem, we propose a novel network called self-supervised masked adaptive network (SSMAN). Firstly, we leverage different levels of masks to cover the richness of occlusion in fully in-the-wild environment. Then, we design a multi-line adaptive network, which could be trained with various scales of masked images in parallel. Based on this masked adaptive network, we train it with self-supervised learning to enforce the consistency across the outputs under different mask ratios. Furthermore, a global refinement module is proposed to leverage global features of the human body to refine the human pose estimated solely by local features. We perform extensive experiments both on the occlusion datasets like 3DPW-OCC and OCHuman and general datasets such as Human3.6M and 3DPW. The results show that SSMAN achieves new state-of-the-art performance on both lightly and heavily occluded benchmarks and is highly competitive with significant improvement on standard benchmarks.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1007/s00138-024-01524-4
Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang
Recently, the methods based on space-time memory network have achieved advanced performance in semi-supervised video object segmentation, which has attracted wide attention. However, this kind of methods still have a fatal limitation. It has the interference problem of similar objects caused by the way of non-local matching, which seriously limits the performance of video object segmentation. To solve this problem, we propose a Kernel-guided Attention Matching Network (KAMNet) by the use of local matching instead of non-local matching. At first, KAMNet uses spatio-temporal attention mechanism to enhance the model’s discrimination between foreground objects and background areas. Then KAMNet utilizes gaussian kernel to guide the matching between the current frame and the reference set. Because the gaussian kernel decays away from the center, it can limit the matching to the central region, thus achieving local matching. Our KAMNet gets speed-accuracy trade-off on benchmark datasets DAVIS 2016 (( mathcal {J & F}) of 87.6%) and DAVIS 2017 (( mathcal {J & F}) of 76.0%) with 0.12 second per frame.
最近,基于时空记忆网络的方法在半监督视频对象分割方面取得了先进的性能,引起了广泛关注。但是,这种方法仍然存在致命的局限性。它存在非局部匹配方式导致的相似对象干扰问题,严重限制了视频对象分割的性能。为了解决这个问题,我们提出了一种利用局部匹配代替非局部匹配的内核引导注意力匹配网络(KAMNet)。首先,KAMNet 利用时空注意力机制来增强模型对前景物体和背景区域的辨别能力。然后,KAMNet 利用高斯核引导当前帧与参考集之间的匹配。由于高斯核从中心开始衰减,它可以将匹配限制在中心区域,从而实现局部匹配。我们的 KAMNet 在基准数据集 DAVIS 2016(87.6%)和 DAVIS 2017(76.0%)上实现了速度与精度的权衡,每帧耗时 0.12 秒。
{"title":"Kernel based local matching network for video object segmentation","authors":"Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang","doi":"10.1007/s00138-024-01524-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01524-4","url":null,"abstract":"<p>Recently, the methods based on space-time memory network have achieved advanced performance in semi-supervised video object segmentation, which has attracted wide attention. However, this kind of methods still have a fatal limitation. It has the interference problem of similar objects caused by the way of non-local matching, which seriously limits the performance of video object segmentation. To solve this problem, we propose a Kernel-guided Attention Matching Network (KAMNet) by the use of local matching instead of non-local matching. At first, KAMNet uses spatio-temporal attention mechanism to enhance the model’s discrimination between foreground objects and background areas. Then KAMNet utilizes gaussian kernel to guide the matching between the current frame and the reference set. Because the gaussian kernel decays away from the center, it can limit the matching to the central region, thus achieving local matching. Our KAMNet gets speed-accuracy trade-off on benchmark datasets DAVIS 2016 (<span>( mathcal {J & F})</span> of 87.6%) and DAVIS 2017 (<span>( mathcal {J & F})</span> of 76.0%) with 0.12 second per frame.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"46 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-23DOI: 10.1007/s00138-024-01510-w
David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić
Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.
{"title":"Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark","authors":"David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić","doi":"10.1007/s00138-024-01510-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01510-w","url":null,"abstract":"<p>Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2015 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-23DOI: 10.1007/s00138-024-01509-3
Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang
Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.
激光雷达和摄像头是自动驾驶环境感知的重要传感器。然而,如何充分融合来自多个来源的异构数据仍是一项艰巨的挑战。因此,基于多模态传感器融合的 3D 物体检测往往不如仅基于激光雷达的单模态方法,这表明多传感器机器视觉仍有待发展。本文提出了一种基于跨模态变换块(AFMCT)的自适应融合模块,利用双向增强策略进行三维物体检测。具体来说,我们首先通过提取基于跨模态变换块的注意力点特征来增强图像特征,并以串联的方式将它们连接起来;然后,另一个跨模态变换块作用于增强后的图像特征,以图像语义信息来强化点特征。在 KITTI 数据集的三维检测基准上进行的大量实验表明,我们提出的结构可以显著提高纯激光雷达方法的检测精度,比现有的先进多传感器融合模块至少高出 0.45%,这表明我们的方法可能是基于多传感器融合改进三维物体检测的可行解决方案。
{"title":"AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection","authors":"Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang","doi":"10.1007/s00138-024-01509-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01509-3","url":null,"abstract":"<p>Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"30 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s00138-024-01523-5
Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp
Hyperspectral (HS) measurement is among the most useful tools in agriculture for early disease detection. However, the cost of HS cameras that can perform the desired detection tasks is prohibitive-typically fifty thousand to hundreds of thousands of dollars. In a previous study at the Agricultural Research Organization’s Volcani Institute (Israel), a low-cost, high-performing HS system was developed which included a point spectrometer and optical components. Its main disadvantage was long shooting time for each image. Shooting time strongly depends on the predetermined integration time of the point spectrometer. While essential for performing monitoring tasks in a reasonable time, shortening integration time from a typical value in the range of 200 ms to the 10 ms range results in deterioration of the dynamic range of the captured scene. In this work, we suggest correcting this by learning the transformation from data measured with short integration time to that measured with long integration time. Reduction of the dynamic range and consequent low SNR were successfully overcome using three developed deep neural networks models based on a denoising auto-encoder, DnCNN and LambdaNetworks architectures as a backbone. The best model was based on DnCNN using a combined loss function of (ell _{2}) and Kullback–Leibler divergence on images with 20 consecutive channels. The full spectrum of the model achieved a mean PSNR of 30.61 and mean SSIM of 0.9, showing total improvement relatively to the 10 ms measurements’ mean PSNR and mean SSIM values by 60.43% and 94.51%, respectively.
{"title":"Hyperspectral image dynamic range reconstruction using deep neural network-based denoising methods","authors":"Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp","doi":"10.1007/s00138-024-01523-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01523-5","url":null,"abstract":"<p>Hyperspectral (HS) measurement is among the most useful tools in agriculture for early disease detection. However, the cost of HS cameras that can perform the desired detection tasks is prohibitive-typically fifty thousand to hundreds of thousands of dollars. In a previous study at the Agricultural Research Organization’s Volcani Institute (Israel), a low-cost, high-performing HS system was developed which included a point spectrometer and optical components. Its main disadvantage was long shooting time for each image. Shooting time strongly depends on the predetermined integration time of the point spectrometer. While essential for performing monitoring tasks in a reasonable time, shortening integration time from a typical value in the range of 200 ms to the 10 ms range results in deterioration of the dynamic range of the captured scene. In this work, we suggest correcting this by learning the transformation from data measured with short integration time to that measured with long integration time. Reduction of the dynamic range and consequent low SNR were successfully overcome using three developed deep neural networks models based on a denoising auto-encoder, DnCNN and LambdaNetworks architectures as a backbone. The best model was based on DnCNN using a combined loss function of <span>(ell _{2})</span> and Kullback–Leibler divergence on images with 20 consecutive channels. The full spectrum of the model achieved a mean PSNR of 30.61 and mean SSIM of 0.9, showing total improvement relatively to the 10 ms measurements’ mean PSNR and mean SSIM values by 60.43% and 94.51%, respectively.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"25 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s00138-024-01517-3
Ecenur Oğuz, Yalım Doğan, Uğur Güdükbay, Oya Karaşan, Mustafa Pınar
Point cloud registration is a fundamental problem in computer vision. The problem encompasses critical tasks such as feature estimation, correspondence matching, and transformation estimation. The point cloud registration problem can be cast as a quantile matching problem. We refined the quantile assignment algorithm by integrating prevalent feature descriptors and transformation estimation methods to enhance the correspondence between the source and target point clouds. We evaluated the performances of these descriptors and methods with our approach through controlled experiments on a dataset we constructed using well-known 3D models. This systematic investigation led us to identify the most suitable methods for complementing our approach. Subsequently, we devised a new end-to-end, coarse-to-fine pairwise point cloud registration framework. Finally, we tested our framework on indoor and outdoor benchmark datasets and compared our results with state-of-the-art point cloud registration methods.
点云注册是计算机视觉中的一个基本问题。该问题包括特征估计、对应匹配和变换估计等关键任务。点云注册问题可以看作是一个量化匹配问题。我们通过整合流行的特征描述和变换估计方法,改进了量子分配算法,以提高源点云和目标点云之间的对应性。我们通过在利用著名 3D 模型构建的数据集上进行对照实验,评估了这些描述符和方法与我们的方法的性能。这一系统性调查使我们确定了最适合补充我们方法的方法。随后,我们设计了一个新的端到端、粗到细的点云配对注册框架。最后,我们在室内和室外基准数据集上测试了我们的框架,并将结果与最先进的点云配准方法进行了比较。
{"title":"Point cloud registration with quantile assignment","authors":"Ecenur Oğuz, Yalım Doğan, Uğur Güdükbay, Oya Karaşan, Mustafa Pınar","doi":"10.1007/s00138-024-01517-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01517-3","url":null,"abstract":"<p>Point cloud registration is a fundamental problem in computer vision. The problem encompasses critical tasks such as feature estimation, correspondence matching, and transformation estimation. The point cloud registration problem can be cast as a quantile matching problem. We refined the quantile assignment algorithm by integrating prevalent feature descriptors and transformation estimation methods to enhance the correspondence between the source and target point clouds. We evaluated the performances of these descriptors and methods with our approach through controlled experiments on a dataset we constructed using well-known 3D models. This systematic investigation led us to identify the most suitable methods for complementing our approach. Subsequently, we devised a new end-to-end, coarse-to-fine pairwise point cloud registration framework. Finally, we tested our framework on indoor and outdoor benchmark datasets and compared our results with state-of-the-art point cloud registration methods.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s00138-024-01522-6
Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim
The automatic assessment of perceived image quality is crucial in the field of image processing. To achieve this idea, we propose an image quality assessment (IQA) method for blurriness. The features of gradient and singular value were extracted in this method instead of the single feature in the traditional IQA algorithms. According to the insufficient size of existing public image quality assessment datasets to support deep learning, machine learning was introduced to fuse the features of multiple domains, and a new no-reference (NR) IQA method for blurriness denoted Feature fusion IQA(Ffu-IQA) was proposed. The Ffu-IQA uses a probabilistic model to estimate the probability of each edge detection blur in the image, and then uses machine learning to aggregate the probability information to obtain the edge quality score. After that uses the singular value obtained by singular value decomposition of the image matrix to calculate the singular value score. Finally, machine learning pooling is used to obtain the true quality score. Ffu-IQA achieves PLCC scores of 0.9570 and 0.9616 on CSIQ and TID2013, respectively, and SROCC scores of 0.9380 and 0.9531, which are better than most traditional image quality assessment methods for blurriness.
{"title":"An image quality assessment method based on edge extraction and singular value for blurriness","authors":"Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim","doi":"10.1007/s00138-024-01522-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01522-6","url":null,"abstract":"<p>The automatic assessment of perceived image quality is crucial in the field of image processing. To achieve this idea, we propose an image quality assessment (IQA) method for blurriness. The features of gradient and singular value were extracted in this method instead of the single feature in the traditional IQA algorithms. According to the insufficient size of existing public image quality assessment datasets to support deep learning, machine learning was introduced to fuse the features of multiple domains, and a new no-reference (NR) IQA method for blurriness denoted Feature fusion IQA(Ffu-IQA) was proposed. The Ffu-IQA uses a probabilistic model to estimate the probability of each edge detection blur in the image, and then uses machine learning to aggregate the probability information to obtain the edge quality score. After that uses the singular value obtained by singular value decomposition of the image matrix to calculate the singular value score. Finally, machine learning pooling is used to obtain the true quality score. Ffu-IQA achieves PLCC scores of 0.9570 and 0.9616 on CSIQ and TID2013, respectively, and SROCC scores of 0.9380 and 0.9531, which are better than most traditional image quality assessment methods for blurriness.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-15DOI: 10.1007/s00138-024-01521-7
Selen Pehlivan, Jorma Laaksonen
By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.
{"title":"Temporal teacher with masked transformers for semi-supervised action proposal generation","authors":"Selen Pehlivan, Jorma Laaksonen","doi":"10.1007/s00138-024-01521-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01521-7","url":null,"abstract":"<p>By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"67 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}