Pub Date : 2024-01-31DOI: 10.1007/s00138-023-01501-3
Friedrich Rieken Münke, Jan Schützke, Felix Berens, Markus Reischl
The objective of this paper is to study the impact of limited datasets on deep learning techniques and conventional methods in semantic image segmentation and to conduct a comparative analysis in order to determine the optimal scenario for utilizing both approaches. We introduce a synthetic data generator, which enables us to evaluate the impact of the number of training samples as well as the difficulty and diversity of the dataset. We show that deep learning methods excel when large datasets are available and conventional image processing approaches perform well when the datasets are small and diverse. Since transfer learning is a common approach to work around small datasets, we are specifically assessing its impact and found only marginal impact. Furthermore, we implement the conventional image processing pipeline to enable fast and easy application to new problems, making it easy to apply and test conventional methods alongside deep learning with minimal overhead.
{"title":"A review of adaptable conventional image processing pipelines and deep learning on limited datasets","authors":"Friedrich Rieken Münke, Jan Schützke, Felix Berens, Markus Reischl","doi":"10.1007/s00138-023-01501-3","DOIUrl":"https://doi.org/10.1007/s00138-023-01501-3","url":null,"abstract":"<p>The objective of this paper is to study the impact of limited datasets on deep learning techniques and conventional methods in semantic image segmentation and to conduct a comparative analysis in order to determine the optimal scenario for utilizing both approaches. We introduce a synthetic data generator, which enables us to evaluate the impact of the number of training samples as well as the difficulty and diversity of the dataset. We show that deep learning methods excel when large datasets are available and conventional image processing approaches perform well when the datasets are small and diverse. Since transfer learning is a common approach to work around small datasets, we are specifically assessing its impact and found only marginal impact. Furthermore, we implement the conventional image processing pipeline to enable fast and easy application to new problems, making it easy to apply and test conventional methods alongside deep learning with minimal overhead.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s00138-023-01503-1
Abstract
Knowledge distillation is a common and effective method in model compression, which trains a compact student model to mimic the capability of a large teacher model to get superior generalization. Previous works on knowledge distillation are underperforming for challenging tasks such as object detection, compared to the general application of unsophisticated classification tasks. In this paper, we propose that the failure of knowledge distillation on object detection is mainly caused by the imbalance between features of informative and invalid background. Not all background noise is redundant, and the valuable background noise after screening contains relations between foreground and background. Therefore, we propose a novel regional filtering distillation (RFD) algorithm to solve this problem through two modules: region selection and attention-guided distillation. Region selection first filters massive invalid backgrounds and retains knowledge-dense regions on near object anchor locations. Attention-guided distillation further improves distillation performance on object detection tasks by extracting the relations between foreground and background to migrate key features. Extensive experiments on both one-stage and two-stage detectors have been conducted to prove the effectiveness of RFD. For example, RFD improves 2.8% and 2.6% mAP for ResNet50-RetinaNet and ResNet50-FPN student networks on the MS COCO dataset, respectively. We also evaluate our method with the Faster R-CNN model on Pascal VOC and KITTI benchmark, which obtain 1.52% and 4.36% mAP promotions for the ResNet18-FPN student network, respectively. Furthermore, our method increases 5.70% of mAP for MobileNetv2-SSD compared to the original model. The proposed RFD technique performs highly on detection tasks through regional filtering distillation. In the future, we plan to extend it to more challenging task scenarios, such as segmentation.
{"title":"Regional filtering distillation for object detection","authors":"","doi":"10.1007/s00138-023-01503-1","DOIUrl":"https://doi.org/10.1007/s00138-023-01503-1","url":null,"abstract":"<h3>Abstract</h3> <p>Knowledge distillation is a common and effective method in model compression, which trains a compact student model to mimic the capability of a large teacher model to get superior generalization. Previous works on knowledge distillation are underperforming for challenging tasks such as object detection, compared to the general application of unsophisticated classification tasks. In this paper, we propose that the failure of knowledge distillation on object detection is mainly caused by the imbalance between features of informative and invalid background. Not all background noise is redundant, and the valuable background noise after screening contains relations between foreground and background. Therefore, we propose a novel regional filtering distillation (RFD) algorithm to solve this problem through two modules: region selection and attention-guided distillation. Region selection first filters massive invalid backgrounds and retains knowledge-dense regions on near object anchor locations. Attention-guided distillation further improves distillation performance on object detection tasks by extracting the relations between foreground and background to migrate key features. Extensive experiments on both one-stage and two-stage detectors have been conducted to prove the effectiveness of RFD. For example, RFD improves 2.8% and 2.6% mAP for ResNet50-RetinaNet and ResNet50-FPN student networks on the MS COCO dataset, respectively. We also evaluate our method with the Faster R-CNN model on Pascal VOC and KITTI benchmark, which obtain 1.52% and 4.36% mAP promotions for the ResNet18-FPN student network, respectively. Furthermore, our method increases 5.70% of mAP for MobileNetv2-SSD compared to the original model. The proposed RFD technique performs highly on detection tasks through regional filtering distillation. In the future, we plan to extend it to more challenging task scenarios, such as segmentation.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s00138-023-01504-0
Mohammad Hajizadeh, Mohammad Sabokrou, Adel Rahmani
The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% mAP accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 fps.
{"title":"STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices","authors":"Mohammad Hajizadeh, Mohammad Sabokrou, Adel Rahmani","doi":"10.1007/s00138-023-01504-0","DOIUrl":"https://doi.org/10.1007/s00138-023-01504-0","url":null,"abstract":"<p>The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% <i>mAP</i> accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 <i>fps</i>.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"32 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-29DOI: 10.1007/s00138-023-01506-y
Qian Wan, Wenhui Guo, Yanjiang Wang
Abstract
Class imbalance frequently arises in the context of image classification. Conventional generative adversarial networks (GANs) have a tendency to produce samples from the majority class when trained on class-imbalanced datasets. To address this issue, the Balancing GAN with gradient penalty (BAGAN-GP) has been proposed, but the outcomes may still exhibit a bias toward the majority categories when the similarity between images from different categories is substantial. In this study, we introduce a novel approach called the Pre-trained Gated Variational Autoencoder with Self-attention for Balancing Generative Adversarial Network (SGBGAN) as an image augmentation technique for generating high-quality images. The proposed method utilizes a Gated Variational Autoencoder with Self-attention (SA-GVAE) to initialize the GAN and transfers pre-trained SA-GVAE weights to the GAN. Our experimental results on Fashion-MNIST, CIFAR-10, and a highly unbalanced medical image dataset demonstrate that the SGBGAN outperforms other state-of-the-art methods. Results on Fréchet inception distance (FID) and structural similarity measures (SSIM) show that our model overcomes the instability problems that exist in other GANs. Especially on the Cells dataset, the FID of a minority class increases up to 23.09% compared to the latest BAGAN-GP, and the SSIM of a minority class increases up to 10.81%. It is proved that SGBGAN overcomes the class imbalance restriction and generates high-quality minority class images.
Graphical abstract
The diagram provides an overview of the technical approach employed in this research paper. To address the issue of class imbalance within the dataset, a novel technique called the Gated Variational Autoencoder with Self-attention (SA-GVAE) is proposed. This SA-GVAE is utilized to initialize the Generative Adversarial Network (GAN), with the pre-trained weights from SA-GVAE being transferred to the GAN. Consequently, a Pre-trained Gated Variational Autoencoder with Self-attention for Balancing GAN (SGBGAN) is formed, serving as an image augmentation tool to generate high-quality images. Ultimately, the generation of minority samples is employed to restore class balance within the dataset.
摘要 在图像分类中经常会出现类不平衡的问题。传统的生成式对抗网络(GAN)在不平衡类别的数据集上进行训练时,往往会产生来自多数类别的样本。为了解决这个问题,有人提出了带梯度惩罚的平衡生成对抗网络(BAGAN-GP),但当不同类别的图像之间存在很大的相似性时,其结果仍可能表现出偏向多数类别的倾向。在本研究中,我们引入了一种名为 "带自注意的预训练门控变异自动编码器平衡生成对抗网络(SGBGAN)"的新方法,作为生成高质量图像的图像增强技术。所提出的方法利用具有自注意功能的门控变异自动编码器(SA-GVAE)来初始化 GAN,并将预先训练好的 SA-GVAE 权重转移到 GAN 中。我们在 Fashion-MNIST、CIFAR-10 和一个高度不平衡的医学图像数据集上的实验结果表明,SGBGAN 的性能优于其他最先进的方法。弗雷谢特起始距离(FID)和结构相似性度量(SSIM)的结果表明,我们的模型克服了其他 GAN 存在的不稳定性问题。特别是在 Cells 数据集上,与最新的 BAGAN-GP 相比,少数类别的 FID 增加了 23.09%,少数类别的 SSIM 增加了 10.81%。事实证明,SGBGAN 克服了类不平衡的限制,生成了高质量的少数类图像。 图解摘要该图概述了本研究论文中采用的技术方法。为了解决数据集中的类不平衡问题,本文提出了一种名为 "具有自我注意功能的门控变异自动编码器"(SA-GVAE)的新技术。这种 SA-GVAE 可用于初始化生成式对抗网络(GAN),并将 SA-GVAE 中预先训练好的权重转移到 GAN 中。这样,就形成了一个具有自注意平衡 GAN 的预训练门控变异自动编码器(SGBGAN),作为生成高质量图像的图像增强工具。最后,通过生成少数样本来恢复数据集中的类平衡。
{"title":"SGBGAN: minority class image generation for class-imbalanced datasets","authors":"Qian Wan, Wenhui Guo, Yanjiang Wang","doi":"10.1007/s00138-023-01506-y","DOIUrl":"https://doi.org/10.1007/s00138-023-01506-y","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Abstract</h3><p>Class imbalance frequently arises in the context of image classification. Conventional generative adversarial networks (GANs) have a tendency to produce samples from the majority class when trained on class-imbalanced datasets. To address this issue, the Balancing GAN with gradient penalty (BAGAN-GP) has been proposed, but the outcomes may still exhibit a bias toward the majority categories when the similarity between images from different categories is substantial. In this study, we introduce a novel approach called the Pre-trained Gated Variational Autoencoder with Self-attention for Balancing Generative Adversarial Network (SGBGAN) as an image augmentation technique for generating high-quality images. The proposed method utilizes a Gated Variational Autoencoder with Self-attention (SA-GVAE) to initialize the GAN and transfers pre-trained SA-GVAE weights to the GAN. Our experimental results on Fashion-MNIST, CIFAR-10, and a highly unbalanced medical image dataset demonstrate that the SGBGAN outperforms other state-of-the-art methods. Results on Fréchet inception distance (FID) and structural similarity measures (SSIM) show that our model overcomes the instability problems that exist in other GANs. Especially on the Cells dataset, the FID of a minority class increases up to 23.09% compared to the latest BAGAN-GP, and the SSIM of a minority class increases up to 10.81%. It is proved that SGBGAN overcomes the class imbalance restriction and generates high-quality minority class images.\u0000</p><h3 data-test=\"abstract-sub-heading\">Graphical abstract</h3><p>The diagram provides an overview of the technical approach employed in this research paper. To address the issue of class imbalance within the dataset, a novel technique called the Gated Variational Autoencoder with Self-attention (SA-GVAE) is proposed. This SA-GVAE is utilized to initialize the Generative Adversarial Network (GAN), with the pre-trained weights from SA-GVAE being transferred to the GAN. Consequently, a Pre-trained Gated Variational Autoencoder with Self-attention for Balancing GAN (SGBGAN) is formed, serving as an image augmentation tool to generate high-quality images. Ultimately, the generation of minority samples is employed to restore class balance within the dataset.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"200 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139649209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-27DOI: 10.1007/s00138-023-01505-z
Zhichao Ma, Kan Li
Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.
视频动作分割是评估理解人类活动能力的一项重要任务。以往关于这项任务的研究主要集中于捕捉复杂的时序结构,没有考虑相似动作之间的特征模糊性和训练集的偏差,因此容易混淆一些动作。本文提出了一种新颖的动作分割框架,称为 DeConfuNet,以解决上述问题。首先,我们设计了一个由自适应边际引导的判别特征学习训练的判别增强模块(DEM),该模块通过自适应调整边际来提高相似动作之间的特征可区分性,其多级推理和自适应特征融合结构为区分相似动作提供了结构优势。其次,我们提出了均衡影响模块(EIM),它可以在系数自适应损失函数下平衡训练样本的影响,从而克服偏差训练集的影响。第三,能量和上下文驱动的细化模块(ECRM)通过融合和细化 DEM 和 EIM 的推理,进一步减轻了训练样本不平衡影响的影响,该模块利用包括上下文和能量线索在内的分阶段预测来同化不可信的片段,从而减轻了过度分割的巨大影响。广泛的实验表明了所提出的每种技术的有效性,它们验证了 DEM 和 EIM 在推理中的互补性,并合作克服了混淆问题,而且我们的方法在具有挑战性的 50Salads、GTEA 和 Breakfast 基准上实现了准确率、编辑分数和 F1 分数的显著提高和一流性能。
{"title":"Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement","authors":"Zhichao Ma, Kan Li","doi":"10.1007/s00138-023-01505-z","DOIUrl":"https://doi.org/10.1007/s00138-023-01505-z","url":null,"abstract":"<p>Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139582210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-19DOI: 10.1007/s00138-023-01502-2
Ruizhi Liao, Junhai Zhai, Feng Zhang
Deep learning has emerged as the leading approach for pattern recognition, but its reliance on large labeled datasets poses challenges in real-world applications where obtaining annotated samples is difficult. Few-shot learning, inspired by human learning, enables fast adaptation to new concepts with limited examples. Optimization-based meta-learning has gained popularity as a few-shot learning method. However, it struggles with capturing long-range dependencies of gradients and has slow convergence rates, making it challenging to extract features from limited samples. To overcome these issues, we propose MLAL, an optimization model based on attention for few-shot learning. The model comprises two parts: the attention-LSTM meta-learner, which optimizes gradients hierarchically using the self-attention mechanism, and the cross-attention base-learner, which uses the cross-attention mechanism to cross-learn the common category features of support and query sets in a meta-task. Extensive experiments on two benchmark datasets show that MLAL achieves exceptional 1-shot and 5-shot classification accuracy on MiniImagenet and TiredImagenet. The codes for our proposed method are available at https://github.com/wflrz123/MLAL.
{"title":"Optimization model based on attention mechanism for few-shot image classification","authors":"Ruizhi Liao, Junhai Zhai, Feng Zhang","doi":"10.1007/s00138-023-01502-2","DOIUrl":"https://doi.org/10.1007/s00138-023-01502-2","url":null,"abstract":"<p>Deep learning has emerged as the leading approach for pattern recognition, but its reliance on large labeled datasets poses challenges in real-world applications where obtaining annotated samples is difficult. Few-shot learning, inspired by human learning, enables fast adaptation to new concepts with limited examples. Optimization-based meta-learning has gained popularity as a few-shot learning method. However, it struggles with capturing long-range dependencies of gradients and has slow convergence rates, making it challenging to extract features from limited samples. To overcome these issues, we propose MLAL, an optimization model based on attention for few-shot learning. The model comprises two parts: the attention-LSTM meta-learner, which optimizes gradients hierarchically using the self-attention mechanism, and the cross-attention base-learner, which uses the cross-attention mechanism to cross-learn the common category features of support and query sets in a meta-task. Extensive experiments on two benchmark datasets show that MLAL achieves exceptional 1-shot and 5-shot classification accuracy on MiniImagenet and TiredImagenet. The codes for our proposed method are available at https://github.com/wflrz123/MLAL.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"19 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-19DOI: 10.1007/s00138-023-01499-8
U. Vijetha, V. Geetha
As the prevalence of vision impairment continues to rise worldwide, there is an increasing need for affordable and accessible solutions that improve the daily experiences of individuals with vision impairment. The Visually Impaired (VI) are often prone to falls and injuries due to their inability to recognize dangers on the path while navigating. It is therefore crucial that they are aware of potential hazards in both known and unknown environments. Obstacle detection plays a key role in navigation assistance solutions for VI users. There has been a surge in experimentation on obstacle detection since the introduction of autonomous navigation in automobiles, robots, and drones. Previously, auditory, laser, and depth sensors dominated obstacle detection; however, advances in computer vision and deep learning have enabled it using simpler tools like smartphone cameras. While previous approaches to obstacle detection using estimated depth data have been effective, they suffer from limitations such as compromised accuracy when adapted for edge devices and the incapability to identify objects in the scene. To address these limitations, we propose an indoor and outdoor obstacle detection and identification technique that combines semantic segmentation with depth estimation data. We hypothesize that this combination of techniques will enhance obstacle detection and identification compared to using depth data alone. To evaluate the effectiveness of our proposed Obstacle detection method, we validated it against ground truth Obstacle data derived from the DIODE and NYU Depth v2 dataset. Our experimental results demonstrate that the proposed method achieves near 85% accuracy in detecting nearby obstacles with lower false positive and false negative rates. The demonstration of the proposed system deployed as an Android app-‘Obs-tackle’ is available at https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf.
随着全球视力障碍患病率的持续上升,人们越来越需要能够改善视力障碍人士日常体验的经济实惠且无障碍的解决方案。视障者(VI)在导航时由于无法识别道路上的危险,往往容易跌倒和受伤。因此,让他们意识到已知和未知环境中的潜在危险至关重要。障碍物检测在为 VI 用户提供导航辅助解决方案方面发挥着关键作用。自从在汽车、机器人和无人机中引入自主导航功能以来,有关障碍物检测的实验就一直在激增。以前,障碍物检测主要使用听觉、激光和深度传感器;然而,计算机视觉和深度学习的进步使得障碍物检测可以使用智能手机摄像头等更简单的工具。虽然以前使用估计深度数据进行障碍物检测的方法很有效,但这些方法也存在局限性,例如在适用于边缘设备时精度会受到影响,而且无法识别场景中的物体。为了解决这些局限性,我们提出了一种将语义分割与深度估算数据相结合的室内外障碍物检测和识别技术。我们假设,与单独使用深度数据相比,这种技术组合将增强障碍物检测和识别能力。为了评估我们提出的障碍物检测方法的有效性,我们利用 DIODE 和 NYU Depth v2 数据集中的地面真实障碍物数据对该方法进行了验证。实验结果表明,我们提出的方法在检测附近障碍物方面达到了接近 85% 的准确率,并且假阳性和假阴性率较低。拟议系统作为安卓应用程序 "Obs-tackle "部署的演示可在 https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf 上获取。
{"title":"Obs-tackle: an obstacle detection system to assist navigation of visually impaired using smartphones","authors":"U. Vijetha, V. Geetha","doi":"10.1007/s00138-023-01499-8","DOIUrl":"https://doi.org/10.1007/s00138-023-01499-8","url":null,"abstract":"<p>As the prevalence of vision impairment continues to rise worldwide, there is an increasing need for affordable and accessible solutions that improve the daily experiences of individuals with vision impairment. The Visually Impaired (VI) are often prone to falls and injuries due to their inability to recognize dangers on the path while navigating. It is therefore crucial that they are aware of potential hazards in both known and unknown environments. Obstacle detection plays a key role in navigation assistance solutions for VI users. There has been a surge in experimentation on obstacle detection since the introduction of autonomous navigation in automobiles, robots, and drones. Previously, auditory, laser, and depth sensors dominated obstacle detection; however, advances in computer vision and deep learning have enabled it using simpler tools like smartphone cameras. While previous approaches to obstacle detection using estimated depth data have been effective, they suffer from limitations such as compromised accuracy when adapted for edge devices and the incapability to identify objects in the scene. To address these limitations, we propose an indoor and outdoor obstacle detection and identification technique that combines semantic segmentation with depth estimation data. We hypothesize that this combination of techniques will enhance obstacle detection and identification compared to using depth data alone. To evaluate the effectiveness of our proposed Obstacle detection method, we validated it against ground truth Obstacle data derived from the DIODE and NYU Depth v2 dataset. Our experimental results demonstrate that the proposed method achieves near 85% accuracy in detecting nearby obstacles with lower false positive and false negative rates. The demonstration of the proposed system deployed as an Android app-‘Obs-tackle’ is available at https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"70 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-12DOI: 10.1007/s00138-023-01497-w
Jinmei Song, Baokai Liu, Yao Yu, Kaiwu Zhang, Shiqiang Du
The graph-based approach is a representative clustering method among multi-view clustering algorithms. However, it remains a challenge to quickly acquire complementary information in multi-view data and to execute effective clustering when the quality of the initially constructed data graph is inadequate. Therefore, we propose multi-view spectral clustering based on constrained Laplacian rank method, a new graph-based method (CLRSC). The following are our contributions: (1) Self-representation learning and CLR are extended to multi-view and they are connected into a unified framework to learn a common affinity matrix. (2) To achieve the overall optimization we construct a graph learning method based on constrained Laplacian rank and combine it with spectral clustering. (3) An iterative optimization-based procedure we designed and showed that our algorithm is convergent. Finally, sufficient experiments are carried out on 5 benchmark datasets. The experimental results on MSRC-v1 and BBCSport datasets show that the accuracy (ACC) of the method is 10.95% and 4.61% higher than the optimal comparison algorithm, respectively.
{"title":"Multi-view spectral clustering based on constrained Laplacian rank","authors":"Jinmei Song, Baokai Liu, Yao Yu, Kaiwu Zhang, Shiqiang Du","doi":"10.1007/s00138-023-01497-w","DOIUrl":"https://doi.org/10.1007/s00138-023-01497-w","url":null,"abstract":"<p>The graph-based approach is a representative clustering method among multi-view clustering algorithms. However, it remains a challenge to quickly acquire complementary information in multi-view data and to execute effective clustering when the quality of the initially constructed data graph is inadequate. Therefore, we propose multi-view spectral clustering based on constrained Laplacian rank method, a new graph-based method (CLRSC). The following are our contributions: (1) Self-representation learning and CLR are extended to multi-view and they are connected into a unified framework to learn a common affinity matrix. (2) To achieve the overall optimization we construct a graph learning method based on constrained Laplacian rank and combine it with spectral clustering. (3) An iterative optimization-based procedure we designed and showed that our algorithm is convergent. Finally, sufficient experiments are carried out on 5 benchmark datasets. The experimental results on MSRC-v1 and BBCSport datasets show that the accuracy (ACC) of the method is 10.95% and 4.61% higher than the optimal comparison algorithm, respectively.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-11DOI: 10.1007/s00138-023-01498-9
C. Elmo Kulanesan, P. Vacher, L. Charleux, E. Roux
In the field of medical applications, precise localization of medical instruments and bone structures is crucial to ensure computer-assisted surgical interventions. In orthopedic surgery, existing devices typically rely on stereoscopic vision. Their purpose is to aid the surgeon in screw fixation of prostheses or bone removal. This article addresses the challenge of localizing a rigid object consisting of randomly arranged planar markers using a single camera. This approach is especially vital in medical situations where accurate object alignment relative to a camera is necessary at distances ranging from 80 cm to 120 cm. In addition, the size limitation of a few tens of centimeters ensures that the resulting locator does not obstruct the work area. This rigid locator consists of a solid at the surface of which a set of plane markers (ArUco) are glued. These plane markers are randomly distributed over the surface in order to systematically have a minimum of two visible markers whatever the orientation of the locator. The calibration of the locator involves finding the relative positions between the individual planar elements and is based on a bundle adjustment approach. One of the main and known difficulties associated with planar markers is the problem of pose ambiguity. To solve this problem, our method lies in the formulation of an efficient initial solution for the optimization step. After the calibration step, the reached positioning uncertainties of the locator are better than two-tenth of a cubic millimeter and one-tenth of a degree, regardless of the orientation of the locator in space. To assess the proposed method, the locator is rigidly attached to a stylus of about twenty centimeters length. Thanks to this approach, the tip of this stylus seen by a 16.1 megapixel camera at a distance of about 1 m is localized in real time in a cube lower than 1 mm side. A surface registration application is proposed by using the stylus on an artificial scapula.
{"title":"High-accuracy 3D locators tracking in real time using monocular vision","authors":"C. Elmo Kulanesan, P. Vacher, L. Charleux, E. Roux","doi":"10.1007/s00138-023-01498-9","DOIUrl":"https://doi.org/10.1007/s00138-023-01498-9","url":null,"abstract":"<p>In the field of medical applications, precise localization of medical instruments and bone structures is crucial to ensure computer-assisted surgical interventions. In orthopedic surgery, existing devices typically rely on stereoscopic vision. Their purpose is to aid the surgeon in screw fixation of prostheses or bone removal. This article addresses the challenge of localizing a rigid object consisting of randomly arranged planar markers using a single camera. This approach is especially vital in medical situations where accurate object alignment relative to a camera is necessary at distances ranging from 80 cm to 120 cm. In addition, the size limitation of a few tens of centimeters ensures that the resulting locator does not obstruct the work area. This rigid locator consists of a solid at the surface of which a set of plane markers (ArUco) are glued. These plane markers are randomly distributed over the surface in order to systematically have a minimum of two visible markers whatever the orientation of the locator. The calibration of the locator involves finding the relative positions between the individual planar elements and is based on a bundle adjustment approach. One of the main and known difficulties associated with planar markers is the problem of pose ambiguity. To solve this problem, our method lies in the formulation of an efficient initial solution for the optimization step. After the calibration step, the reached positioning uncertainties of the locator are better than two-tenth of a cubic millimeter and one-tenth of a degree, regardless of the orientation of the locator in space. To assess the proposed method, the locator is rigidly attached to a stylus of about twenty centimeters length. Thanks to this approach, the tip of this stylus seen by a 16.1 megapixel camera at a distance of about 1 m is localized in real time in a cube lower than 1 mm side. A surface registration application is proposed by using the stylus on an artificial scapula.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"129 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1007/s00138-023-01495-y
Kaya Turgut, Helin Dutagaci
Data organization via forming local regions is an integral part of deep learning networks that process 3D point clouds in a hierarchical manner. At each level, the point cloud is sampled to extract representative points and these points are used to be centers of local regions. The organization of local regions is of considerable importance since it determines the location and size of the receptive field at a particular layer of feature aggregation. In this paper, we present two local region-learning modules: Center Shift Module to infer the appropriate shift for each center point, and Radius Update Module to alter the radius of each local region. The parameters of the modules are learned through optimizing the loss associated with the particular task within an end-to-end network. We present alternatives for these modules through various ways of modeling the interactions of the features and locations of 3D points in the point cloud. We integrated both modules independently and together to the PointNet++ and PointCNN object classification architectures, and demonstrated that the modules contributed to a significant increase in classification accuracy for the ScanObjectNN data set consisting of scans of real-world objects. Our further experiments on ShapeNet data set showed that the modules are also effective on 3D CAD models.
{"title":"Local region-learning modules for point cloud classification","authors":"Kaya Turgut, Helin Dutagaci","doi":"10.1007/s00138-023-01495-y","DOIUrl":"https://doi.org/10.1007/s00138-023-01495-y","url":null,"abstract":"<p>Data organization via forming local regions is an integral part of deep learning networks that process 3D point clouds in a hierarchical manner. At each level, the point cloud is sampled to extract representative points and these points are used to be centers of local regions. The organization of local regions is of considerable importance since it determines the location and size of the receptive field at a particular layer of feature aggregation. In this paper, we present two local region-learning modules: Center Shift Module to infer the appropriate shift for each center point, and Radius Update Module to alter the radius of each local region. The parameters of the modules are learned through optimizing the loss associated with the particular task within an end-to-end network. We present alternatives for these modules through various ways of modeling the interactions of the features and locations of 3D points in the point cloud. We integrated both modules independently and together to the PointNet++ and PointCNN object classification architectures, and demonstrated that the modules contributed to a significant increase in classification accuracy for the ScanObjectNN data set consisting of scans of real-world objects. Our further experiments on ShapeNet data set showed that the modules are also effective on 3D CAD models.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"307 5 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}