Machine Vision and Applications最新文献_第10页

A review of adaptable conventional image processing pipelines and deep learning on limited datasets 回顾有限数据集上的可调整传统图像处理管道和深度学习

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-31 DOI: 10.1007/s00138-023-01501-3

Friedrich Rieken Münke, Jan Schützke, Felix Berens, Markus Reischl

The objective of this paper is to study the impact of limited datasets on deep learning techniques and conventional methods in semantic image segmentation and to conduct a comparative analysis in order to determine the optimal scenario for utilizing both approaches. We introduce a synthetic data generator, which enables us to evaluate the impact of the number of training samples as well as the difficulty and diversity of the dataset. We show that deep learning methods excel when large datasets are available and conventional image processing approaches perform well when the datasets are small and diverse. Since transfer learning is a common approach to work around small datasets, we are specifically assessing its impact and found only marginal impact. Furthermore, we implement the conventional image processing pipeline to enable fast and easy application to new problems, making it easy to apply and test conventional methods alongside deep learning with minimal overhead.

本文旨在研究有限数据集对语义图像分割中深度学习技术和传统方法的影响，并进行对比分析，以确定使用这两种方法的最佳方案。我们引入了一个合成数据生成器，它使我们能够评估训练样本数量以及数据集难度和多样性的影响。我们的研究表明，在有大型数据集的情况下，深度学习方法表现出色；而在数据集较小且多样化的情况下，传统的图像处理方法表现出色。由于迁移学习是处理小型数据集的常用方法，我们专门评估了其影响，结果发现其影响微乎其微。此外，我们还实施了传统图像处理管道，以便快速、轻松地应用于新问题，从而以最小的开销轻松应用和测试传统方法与深度学习。

引用次数: 0

Regional filtering distillation for object detection 用于物体检测的区域过滤提炼

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-31 DOI: 10.1007/s00138-023-01503-1

Abstract

Knowledge distillation is a common and effective method in model compression, which trains a compact student model to mimic the capability of a large teacher model to get superior generalization. Previous works on knowledge distillation are underperforming for challenging tasks such as object detection, compared to the general application of unsophisticated classification tasks. In this paper, we propose that the failure of knowledge distillation on object detection is mainly caused by the imbalance between features of informative and invalid background. Not all background noise is redundant, and the valuable background noise after screening contains relations between foreground and background. Therefore, we propose a novel regional filtering distillation (RFD) algorithm to solve this problem through two modules: region selection and attention-guided distillation. Region selection first filters massive invalid backgrounds and retains knowledge-dense regions on near object anchor locations. Attention-guided distillation further improves distillation performance on object detection tasks by extracting the relations between foreground and background to migrate key features. Extensive experiments on both one-stage and two-stage detectors have been conducted to prove the effectiveness of RFD. For example, RFD improves 2.8% and 2.6% mAP for ResNet50-RetinaNet and ResNet50-FPN student networks on the MS COCO dataset, respectively. We also evaluate our method with the Faster R-CNN model on Pascal VOC and KITTI benchmark, which obtain 1.52% and 4.36% mAP promotions for the ResNet18-FPN student network, respectively. Furthermore, our method increases 5.70% of mAP for MobileNetv2-SSD compared to the original model. The proposed RFD technique performs highly on detection tasks through regional filtering distillation. In the future, we plan to extend it to more challenging task scenarios, such as segmentation.

摘要知识蒸馏是模型压缩中一种常见而有效的方法，它通过训练一个紧凑的学生模型来模仿大型教师模型的能力，从而获得卓越的泛化效果。与一般应用的不复杂分类任务相比，以往的知识蒸馏工作在物体检测等高难度任务中表现不佳。在本文中，我们提出知识蒸馏在物体检测上的失败主要是由有信息和无信息背景特征之间的不平衡造成的。并非所有的背景噪声都是冗余的，筛选后有价值的背景噪声包含前景和背景之间的关系。因此，我们提出了一种新颖的区域过滤蒸馏（RFD）算法，通过区域选择和注意力引导蒸馏两个模块来解决这一问题。区域选择首先会过滤大量无效背景，并保留物体锚点附近的知识密集区域。注意力引导蒸馏通过提取前景和背景之间的关系来迁移关键特征，从而进一步提高物体检测任务的蒸馏性能。在单级和两级检测器上进行的大量实验证明了 RFD 的有效性。例如，在 MS COCO 数据集上，RFD 对 ResNet50-RetinaNet 和 ResNet50-FPN 学生网络的 mAP 分别提高了 2.8% 和 2.6%。我们还在 Pascal VOC 和 KITTI 基准上用 Faster R-CNN 模型评估了我们的方法，结果显示 ResNet18-FPN 学生网络的 mAP 分别提高了 1.52% 和 4.36%。此外，与原始模型相比，我们的方法使 MobileNetv2-SSD 的 mAP 提高了 5.70%。通过区域过滤提炼，所提出的 RFD 技术在检测任务中表现出色。未来，我们计划将其扩展到更具挑战性的任务场景中，如分割。

{"title":"Regional filtering distillation for object detection","authors":"","doi":"10.1007/s00138-023-01503-1","DOIUrl":"https://doi.org/10.1007/s00138-023-01503-1","url":null,"abstract":"<h3>Abstract</h3> <p>Knowledge distillation is a common and effective method in model compression, which trains a compact student model to mimic the capability of a large teacher model to get superior generalization. Previous works on knowledge distillation are underperforming for challenging tasks such as object detection, compared to the general application of unsophisticated classification tasks. In this paper, we propose that the failure of knowledge distillation on object detection is mainly caused by the imbalance between features of informative and invalid background. Not all background noise is redundant, and the valuable background noise after screening contains relations between foreground and background. Therefore, we propose a novel regional filtering distillation (RFD) algorithm to solve this problem through two modules: region selection and attention-guided distillation. Region selection first filters massive invalid backgrounds and retains knowledge-dense regions on near object anchor locations. Attention-guided distillation further improves distillation performance on object detection tasks by extracting the relations between foreground and background to migrate key features. Extensive experiments on both one-stage and two-stage detectors have been conducted to prove the effectiveness of RFD. For example, RFD improves 2.8% and 2.6% mAP for ResNet50-RetinaNet and ResNet50-FPN student networks on the MS COCO dataset, respectively. We also evaluate our method with the Faster R-CNN model on Pascal VOC and KITTI benchmark, which obtain 1.52% and 4.36% mAP promotions for the ResNet18-FPN student network, respectively. Furthermore, our method increases 5.70% of mAP for MobileNetv2-SSD compared to the original model. The proposed RFD technique performs highly on detection tasks through regional filtering distillation. In the future, we plan to extend it to more challenging task scenarios, such as segmentation.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices STARNet：用于嵌入式设备上高效视频对象检测的时空感知递归网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-31 DOI: 10.1007/s00138-023-01504-0

Mohammad Hajizadeh, Mohammad Sabokrou, Adel Rahmani

The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% mAP accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 fps.

将各种物体检测方法从图像转换到视频的难题仍未解决。将图像方法应用于视频时，由于图像模糊、位置不同和不清晰、质量低等相关问题，经常无法有效推广。此外，视频对象检测缺乏良好的长期记忆也是一个额外的挑战。在大多数情况下，众所周知连续帧的输出非常相似，因此需要依赖这一事实。此外，一系列连续或非连续帧所包含的信息量要大于单帧所包含的信息量。在本研究中，我们提出了一种用于特征传播的新型递归单元，并确定了层的最佳位置以增加记忆间隔。因此，与其他研究中提出的方法相比，我们实现了更高的准确度。硬件限制会加剧这一挑战。本文的目的是在嵌入式设备上实现这些方法并提高其效率。我们在嵌入式设备的 ImageNet VID 数据集上以 52 fps 的速度实时实现了 68.7% 的 mAP 准确率。

{"title":"STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices","authors":"Mohammad Hajizadeh, Mohammad Sabokrou, Adel Rahmani","doi":"10.1007/s00138-023-01504-0","DOIUrl":"https://doi.org/10.1007/s00138-023-01504-0","url":null,"abstract":"<p>The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% <i>mAP</i> accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 <i>fps</i>.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"32 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139648920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SGBGAN: minority class image generation for class-imbalanced datasets SGBGAN：为类不平衡数据集生成少数类图像

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-29 DOI: 10.1007/s00138-023-01506-y

Qian Wan, Wenhui Guo, Yanjiang Wang

Abstract

Class imbalance frequently arises in the context of image classification. Conventional generative adversarial networks (GANs) have a tendency to produce samples from the majority class when trained on class-imbalanced datasets. To address this issue, the Balancing GAN with gradient penalty (BAGAN-GP) has been proposed, but the outcomes may still exhibit a bias toward the majority categories when the similarity between images from different categories is substantial. In this study, we introduce a novel approach called the Pre-trained Gated Variational Autoencoder with Self-attention for Balancing Generative Adversarial Network (SGBGAN) as an image augmentation technique for generating high-quality images. The proposed method utilizes a Gated Variational Autoencoder with Self-attention (SA-GVAE) to initialize the GAN and transfers pre-trained SA-GVAE weights to the GAN. Our experimental results on Fashion-MNIST, CIFAR-10, and a highly unbalanced medical image dataset demonstrate that the SGBGAN outperforms other state-of-the-art methods. Results on Fréchet inception distance (FID) and structural similarity measures (SSIM) show that our model overcomes the instability problems that exist in other GANs. Especially on the Cells dataset, the FID of a minority class increases up to 23.09% compared to the latest BAGAN-GP, and the SSIM of a minority class increases up to 10.81%. It is proved that SGBGAN overcomes the class imbalance restriction and generates high-quality minority class images.

Graphical abstract

The diagram provides an overview of the technical approach employed in this research paper. To address the issue of class imbalance within the dataset, a novel technique called the Gated Variational Autoencoder with Self-attention (SA-GVAE) is proposed. This SA-GVAE is utilized to initialize the Generative Adversarial Network (GAN), with the pre-trained weights from SA-GVAE being transferred to the GAN. Consequently, a Pre-trained Gated Variational Autoencoder with Self-attention for Balancing GAN (SGBGAN) is formed, serving as an image augmentation tool to generate high-quality images. Ultimately, the generation of minority samples is employed to restore class balance within the dataset.

摘要在图像分类中经常会出现类不平衡的问题。传统的生成式对抗网络（GAN）在不平衡类别的数据集上进行训练时，往往会产生来自多数类别的样本。为了解决这个问题，有人提出了带梯度惩罚的平衡生成对抗网络（BAGAN-GP），但当不同类别的图像之间存在很大的相似性时，其结果仍可能表现出偏向多数类别的倾向。在本研究中，我们引入了一种名为 "带自注意的预训练门控变异自动编码器平衡生成对抗网络（SGBGAN）"的新方法，作为生成高质量图像的图像增强技术。所提出的方法利用具有自注意功能的门控变异自动编码器（SA-GVAE）来初始化 GAN，并将预先训练好的 SA-GVAE 权重转移到 GAN 中。我们在 Fashion-MNIST、CIFAR-10 和一个高度不平衡的医学图像数据集上的实验结果表明，SGBGAN 的性能优于其他最先进的方法。弗雷谢特起始距离（FID）和结构相似性度量（SSIM）的结果表明，我们的模型克服了其他 GAN 存在的不稳定性问题。特别是在 Cells 数据集上，与最新的 BAGAN-GP 相比，少数类别的 FID 增加了 23.09%，少数类别的 SSIM 增加了 10.81%。事实证明，SGBGAN 克服了类不平衡的限制，生成了高质量的少数类图像。图解摘要该图概述了本研究论文中采用的技术方法。为了解决数据集中的类不平衡问题，本文提出了一种名为 "具有自我注意功能的门控变异自动编码器"（SA-GVAE）的新技术。这种 SA-GVAE 可用于初始化生成式对抗网络（GAN），并将 SA-GVAE 中预先训练好的权重转移到 GAN 中。这样，就形成了一个具有自注意平衡 GAN 的预训练门控变异自动编码器（SGBGAN），作为生成高质量图像的图像增强工具。最后，通过生成少数样本来恢复数据集中的类平衡。

{"title":"SGBGAN: minority class image generation for class-imbalanced datasets","authors":"Qian Wan, Wenhui Guo, Yanjiang Wang","doi":"10.1007/s00138-023-01506-y","DOIUrl":"https://doi.org/10.1007/s00138-023-01506-y","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Abstract</h3><p>Class imbalance frequently arises in the context of image classification. Conventional generative adversarial networks (GANs) have a tendency to produce samples from the majority class when trained on class-imbalanced datasets. To address this issue, the Balancing GAN with gradient penalty (BAGAN-GP) has been proposed, but the outcomes may still exhibit a bias toward the majority categories when the similarity between images from different categories is substantial. In this study, we introduce a novel approach called the Pre-trained Gated Variational Autoencoder with Self-attention for Balancing Generative Adversarial Network (SGBGAN) as an image augmentation technique for generating high-quality images. The proposed method utilizes a Gated Variational Autoencoder with Self-attention (SA-GVAE) to initialize the GAN and transfers pre-trained SA-GVAE weights to the GAN. Our experimental results on Fashion-MNIST, CIFAR-10, and a highly unbalanced medical image dataset demonstrate that the SGBGAN outperforms other state-of-the-art methods. Results on Fréchet inception distance (FID) and structural similarity measures (SSIM) show that our model overcomes the instability problems that exist in other GANs. Especially on the Cells dataset, the FID of a minority class increases up to 23.09% compared to the latest BAGAN-GP, and the SSIM of a minority class increases up to 10.81%. It is proved that SGBGAN overcomes the class imbalance restriction and generates high-quality minority class images.\u0000</p><h3 data-test=\"abstract-sub-heading\">Graphical abstract</h3><p>The diagram provides an overview of the technical approach employed in this research paper. To address the issue of class imbalance within the dataset, a novel technique called the Gated Variational Autoencoder with Self-attention (SA-GVAE) is proposed. This SA-GVAE is utilized to initialize the Generative Adversarial Network (GAN), with the pre-trained weights from SA-GVAE being transferred to the GAN. Consequently, a Pre-trained Gated Variational Autoencoder with Self-attention for Balancing GAN (SGBGAN) is formed, serving as an image augmentation tool to generate high-quality images. Ultimately, the generation of minority samples is employed to restore class balance within the dataset.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"200 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139649209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement 利用自适应余量和能量驱动细化技术解决动作分割中的动作混淆问题

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-27 DOI: 10.1007/s00138-023-01505-z

Zhichao Ma, Kan Li

Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.

视频动作分割是评估理解人类活动能力的一项重要任务。以往关于这项任务的研究主要集中于捕捉复杂的时序结构，没有考虑相似动作之间的特征模糊性和训练集的偏差，因此容易混淆一些动作。本文提出了一种新颖的动作分割框架，称为 DeConfuNet，以解决上述问题。首先，我们设计了一个由自适应边际引导的判别特征学习训练的判别增强模块（DEM），该模块通过自适应调整边际来提高相似动作之间的特征可区分性，其多级推理和自适应特征融合结构为区分相似动作提供了结构优势。其次，我们提出了均衡影响模块（EIM），它可以在系数自适应损失函数下平衡训练样本的影响，从而克服偏差训练集的影响。第三，能量和上下文驱动的细化模块（ECRM）通过融合和细化 DEM 和 EIM 的推理，进一步减轻了训练样本不平衡影响的影响，该模块利用包括上下文和能量线索在内的分阶段预测来同化不可信的片段，从而减轻了过度分割的巨大影响。广泛的实验表明了所提出的每种技术的有效性，它们验证了 DEM 和 EIM 在推理中的互补性，并合作克服了混淆问题，而且我们的方法在具有挑战性的 50Salads、GTEA 和 Breakfast 基准上实现了准确率、编辑分数和 F1 分数的显著提高和一流性能。

{"title":"Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement","authors":"Zhichao Ma, Kan Li","doi":"10.1007/s00138-023-01505-z","DOIUrl":"https://doi.org/10.1007/s00138-023-01505-z","url":null,"abstract":"<p>Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139582210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimization model based on attention mechanism for few-shot image classification 基于注意力机制的优化模型，适用于少镜头图像分类

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-19 DOI: 10.1007/s00138-023-01502-2

Ruizhi Liao, Junhai Zhai, Feng Zhang

Deep learning has emerged as the leading approach for pattern recognition, but its reliance on large labeled datasets poses challenges in real-world applications where obtaining annotated samples is difficult. Few-shot learning, inspired by human learning, enables fast adaptation to new concepts with limited examples. Optimization-based meta-learning has gained popularity as a few-shot learning method. However, it struggles with capturing long-range dependencies of gradients and has slow convergence rates, making it challenging to extract features from limited samples. To overcome these issues, we propose MLAL, an optimization model based on attention for few-shot learning. The model comprises two parts: the attention-LSTM meta-learner, which optimizes gradients hierarchically using the self-attention mechanism, and the cross-attention base-learner, which uses the cross-attention mechanism to cross-learn the common category features of support and query sets in a meta-task. Extensive experiments on two benchmark datasets show that MLAL achieves exceptional 1-shot and 5-shot classification accuracy on MiniImagenet and TiredImagenet. The codes for our proposed method are available at https://github.com/wflrz123/MLAL.

深度学习已成为模式识别的主要方法，但它对大型标注数据集的依赖给实际应用带来了挑战，因为在实际应用中很难获得标注样本。受人类学习启发而产生的 "快速学习"（Few-shot learning），能在有限的示例中快速适应新概念。基于优化的元学习作为一种少量学习方法，已经广受欢迎。然而，这种方法难以捕捉梯度的长程依赖性，而且收敛速度较慢，因此从有限的样本中提取特征具有挑战性。为了克服这些问题，我们提出了基于注意力的少次学习优化模型 MLAL。该模型由两部分组成：注意力-LSTM 元学习器（利用自我注意力机制分层优化梯度）和交叉注意力基础学习器（利用交叉注意力机制交叉学习元任务中支持集和查询集的共同类别特征）。在两个基准数据集上进行的广泛实验表明，MLAL 在 MiniImagenet 和 TiredImagenet 上实现了卓越的 1shot 和 5shot 分类准确率。我们提出的方法的代码见 https://github.com/wflrz123/MLAL。

{"title":"Optimization model based on attention mechanism for few-shot image classification","authors":"Ruizhi Liao, Junhai Zhai, Feng Zhang","doi":"10.1007/s00138-023-01502-2","DOIUrl":"https://doi.org/10.1007/s00138-023-01502-2","url":null,"abstract":"<p>Deep learning has emerged as the leading approach for pattern recognition, but its reliance on large labeled datasets poses challenges in real-world applications where obtaining annotated samples is difficult. Few-shot learning, inspired by human learning, enables fast adaptation to new concepts with limited examples. Optimization-based meta-learning has gained popularity as a few-shot learning method. However, it struggles with capturing long-range dependencies of gradients and has slow convergence rates, making it challenging to extract features from limited samples. To overcome these issues, we propose MLAL, an optimization model based on attention for few-shot learning. The model comprises two parts: the attention-LSTM meta-learner, which optimizes gradients hierarchically using the self-attention mechanism, and the cross-attention base-learner, which uses the cross-attention mechanism to cross-learn the common category features of support and query sets in a meta-task. Extensive experiments on two benchmark datasets show that MLAL achieves exceptional 1-shot and 5-shot classification accuracy on MiniImagenet and TiredImagenet. The codes for our proposed method are available at https://github.com/wflrz123/MLAL.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"19 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Obs-tackle: an obstacle detection system to assist navigation of visually impaired using smartphones Obs-tackle：使用智能手机辅助视障人士导航的障碍物探测系统

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-19 DOI: 10.1007/s00138-023-01499-8

U. Vijetha, V. Geetha

As the prevalence of vision impairment continues to rise worldwide, there is an increasing need for affordable and accessible solutions that improve the daily experiences of individuals with vision impairment. The Visually Impaired (VI) are often prone to falls and injuries due to their inability to recognize dangers on the path while navigating. It is therefore crucial that they are aware of potential hazards in both known and unknown environments. Obstacle detection plays a key role in navigation assistance solutions for VI users. There has been a surge in experimentation on obstacle detection since the introduction of autonomous navigation in automobiles, robots, and drones. Previously, auditory, laser, and depth sensors dominated obstacle detection; however, advances in computer vision and deep learning have enabled it using simpler tools like smartphone cameras. While previous approaches to obstacle detection using estimated depth data have been effective, they suffer from limitations such as compromised accuracy when adapted for edge devices and the incapability to identify objects in the scene. To address these limitations, we propose an indoor and outdoor obstacle detection and identification technique that combines semantic segmentation with depth estimation data. We hypothesize that this combination of techniques will enhance obstacle detection and identification compared to using depth data alone. To evaluate the effectiveness of our proposed Obstacle detection method, we validated it against ground truth Obstacle data derived from the DIODE and NYU Depth v2 dataset. Our experimental results demonstrate that the proposed method achieves near 85% accuracy in detecting nearby obstacles with lower false positive and false negative rates. The demonstration of the proposed system deployed as an Android app-‘Obs-tackle’ is available at https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf.

随着全球视力障碍患病率的持续上升，人们越来越需要能够改善视力障碍人士日常体验的经济实惠且无障碍的解决方案。视障者（VI）在导航时由于无法识别道路上的危险，往往容易跌倒和受伤。因此，让他们意识到已知和未知环境中的潜在危险至关重要。障碍物检测在为 VI 用户提供导航辅助解决方案方面发挥着关键作用。自从在汽车、机器人和无人机中引入自主导航功能以来，有关障碍物检测的实验就一直在激增。以前，障碍物检测主要使用听觉、激光和深度传感器；然而，计算机视觉和深度学习的进步使得障碍物检测可以使用智能手机摄像头等更简单的工具。虽然以前使用估计深度数据进行障碍物检测的方法很有效，但这些方法也存在局限性，例如在适用于边缘设备时精度会受到影响，而且无法识别场景中的物体。为了解决这些局限性，我们提出了一种将语义分割与深度估算数据相结合的室内外障碍物检测和识别技术。我们假设，与单独使用深度数据相比，这种技术组合将增强障碍物检测和识别能力。为了评估我们提出的障碍物检测方法的有效性，我们利用 DIODE 和 NYU Depth v2 数据集中的地面真实障碍物数据对该方法进行了验证。实验结果表明，我们提出的方法在检测附近障碍物方面达到了接近 85% 的准确率，并且假阳性和假阴性率较低。拟议系统作为安卓应用程序 "Obs-tackle "部署的演示可在 https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf 上获取。

{"title":"Obs-tackle: an obstacle detection system to assist navigation of visually impaired using smartphones","authors":"U. Vijetha, V. Geetha","doi":"10.1007/s00138-023-01499-8","DOIUrl":"https://doi.org/10.1007/s00138-023-01499-8","url":null,"abstract":"<p>As the prevalence of vision impairment continues to rise worldwide, there is an increasing need for affordable and accessible solutions that improve the daily experiences of individuals with vision impairment. The Visually Impaired (VI) are often prone to falls and injuries due to their inability to recognize dangers on the path while navigating. It is therefore crucial that they are aware of potential hazards in both known and unknown environments. Obstacle detection plays a key role in navigation assistance solutions for VI users. There has been a surge in experimentation on obstacle detection since the introduction of autonomous navigation in automobiles, robots, and drones. Previously, auditory, laser, and depth sensors dominated obstacle detection; however, advances in computer vision and deep learning have enabled it using simpler tools like smartphone cameras. While previous approaches to obstacle detection using estimated depth data have been effective, they suffer from limitations such as compromised accuracy when adapted for edge devices and the incapability to identify objects in the scene. To address these limitations, we propose an indoor and outdoor obstacle detection and identification technique that combines semantic segmentation with depth estimation data. We hypothesize that this combination of techniques will enhance obstacle detection and identification compared to using depth data alone. To evaluate the effectiveness of our proposed Obstacle detection method, we validated it against ground truth Obstacle data derived from the DIODE and NYU Depth v2 dataset. Our experimental results demonstrate that the proposed method achieves near 85% accuracy in detecting nearby obstacles with lower false positive and false negative rates. The demonstration of the proposed system deployed as an Android app-‘Obs-tackle’ is available at https://youtu.be/PSn-FEc5EQg?si=qPGB13tkYkD1kSOf.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"70 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-view spectral clustering based on constrained Laplacian rank 基于受限拉普拉斯秩的多视角光谱聚类

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-12 DOI: 10.1007/s00138-023-01497-w

Jinmei Song, Baokai Liu, Yao Yu, Kaiwu Zhang, Shiqiang Du

The graph-based approach is a representative clustering method among multi-view clustering algorithms. However, it remains a challenge to quickly acquire complementary information in multi-view data and to execute effective clustering when the quality of the initially constructed data graph is inadequate. Therefore, we propose multi-view spectral clustering based on constrained Laplacian rank method, a new graph-based method (CLRSC). The following are our contributions: (1) Self-representation learning and CLR are extended to multi-view and they are connected into a unified framework to learn a common affinity matrix. (2) To achieve the overall optimization we construct a graph learning method based on constrained Laplacian rank and combine it with spectral clustering. (3) An iterative optimization-based procedure we designed and showed that our algorithm is convergent. Finally, sufficient experiments are carried out on 5 benchmark datasets. The experimental results on MSRC-v1 and BBCSport datasets show that the accuracy (ACC) of the method is 10.95% and 4.61% higher than the optimal comparison algorithm, respectively.

基于图的方法是多视图聚类算法中具有代表性的聚类方法。然而，在初始构建的数据图质量不高的情况下，如何快速获取多视图数据中的互补信息并进行有效聚类仍是一项挑战。因此，我们提出了基于约束拉普拉斯秩方法的多视角光谱聚类，这是一种基于图的新方法（CLRSC）。我们的贡献如下：（1）将自表示学习和 CLR 扩展到多视图，并将它们连接到一个统一的框架中，以学习一个共同的亲和矩阵。(2) 为了实现整体优化，我们构建了一种基于受限拉普拉斯秩的图学习方法，并将其与光谱聚类相结合。(3) 我们设计了一种基于迭代优化的程序，并证明我们的算法是收敛的。最后，我们在 5 个基准数据集上进行了充分的实验。在 MSRC-v1 和 BBCSport 数据集上的实验结果表明，该方法的准确率（ACC）分别比最优比较算法高出 10.95% 和 4.61%。

{"title":"Multi-view spectral clustering based on constrained Laplacian rank","authors":"Jinmei Song, Baokai Liu, Yao Yu, Kaiwu Zhang, Shiqiang Du","doi":"10.1007/s00138-023-01497-w","DOIUrl":"https://doi.org/10.1007/s00138-023-01497-w","url":null,"abstract":"<p>The graph-based approach is a representative clustering method among multi-view clustering algorithms. However, it remains a challenge to quickly acquire complementary information in multi-view data and to execute effective clustering when the quality of the initially constructed data graph is inadequate. Therefore, we propose multi-view spectral clustering based on constrained Laplacian rank method, a new graph-based method (CLRSC). The following are our contributions: (1) Self-representation learning and CLR are extended to multi-view and they are connected into a unified framework to learn a common affinity matrix. (2) To achieve the overall optimization we construct a graph learning method based on constrained Laplacian rank and combine it with spectral clustering. (3) An iterative optimization-based procedure we designed and showed that our algorithm is convergent. Finally, sufficient experiments are carried out on 5 benchmark datasets. The experimental results on MSRC-v1 and BBCSport datasets show that the accuracy (ACC) of the method is 10.95% and 4.61% higher than the optimal comparison algorithm, respectively.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-accuracy 3D locators tracking in real time using monocular vision 利用单目视觉实时跟踪高精度 3D 定位器

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-01-11 DOI: 10.1007/s00138-023-01498-9

C. Elmo Kulanesan, P. Vacher, L. Charleux, E. Roux

In the field of medical applications, precise localization of medical instruments and bone structures is crucial to ensure computer-assisted surgical interventions. In orthopedic surgery, existing devices typically rely on stereoscopic vision. Their purpose is to aid the surgeon in screw fixation of prostheses or bone removal. This article addresses the challenge of localizing a rigid object consisting of randomly arranged planar markers using a single camera. This approach is especially vital in medical situations where accurate object alignment relative to a camera is necessary at distances ranging from 80 cm to 120 cm. In addition, the size limitation of a few tens of centimeters ensures that the resulting locator does not obstruct the work area. This rigid locator consists of a solid at the surface of which a set of plane markers (ArUco) are glued. These plane markers are randomly distributed over the surface in order to systematically have a minimum of two visible markers whatever the orientation of the locator. The calibration of the locator involves finding the relative positions between the individual planar elements and is based on a bundle adjustment approach. One of the main and known difficulties associated with planar markers is the problem of pose ambiguity. To solve this problem, our method lies in the formulation of an efficient initial solution for the optimization step. After the calibration step, the reached positioning uncertainties of the locator are better than two-tenth of a cubic millimeter and one-tenth of a degree, regardless of the orientation of the locator in space. To assess the proposed method, the locator is rigidly attached to a stylus of about twenty centimeters length. Thanks to this approach, the tip of this stylus seen by a 16.1 megapixel camera at a distance of about 1 m is localized in real time in a cube lower than 1 mm side. A surface registration application is proposed by using the stylus on an artificial scapula.

在医疗应用领域，医疗器械和骨骼结构的精确定位对于确保计算机辅助手术干预至关重要。在骨科手术中，现有设备通常依赖于立体视觉。其目的是帮助外科医生进行假体的螺钉固定或骨切除。本文探讨的难题是如何使用单个摄像头定位由随机排列的平面标记组成的刚性物体。这种方法在医疗领域尤为重要，因为在这种情况下，需要在 80 厘米到 120 厘米的距离内将物体与摄像机准确对准。此外，几十厘米的尺寸限制确保了定位器不会妨碍工作区域。这种刚性定位器由一个实体组成，实体表面粘有一组平面标记（ArUco）。这些平面标记在表面上随机分布，以便无论定位器的方向如何，都至少有两个可见标记。定位器的校准涉及到寻找各个平面元素之间的相对位置，并以束调整方法为基础。与平面标记相关的已知主要困难之一是姿态模糊问题。为了解决这个问题，我们的方法在于为优化步骤制定一个有效的初始解决方案。校准步骤完成后，无论定位器在空间中的方向如何，定位器达到的定位不确定性均优于十万分之二立方毫米和十分之一度。为了评估所提出的方法，定位器被固定在一根约二十厘米长的测针上。通过这种方法，一台 1610 万像素的摄像头在约 1 米的距离上看到的这支触笔的笔尖可以实时定位在一个边长小于 1 毫米的立方体中。通过在人工肩胛骨上使用测针，提出了一种表面注册应用。

{"title":"High-accuracy 3D locators tracking in real time using monocular vision","authors":"C. Elmo Kulanesan, P. Vacher, L. Charleux, E. Roux","doi":"10.1007/s00138-023-01498-9","DOIUrl":"https://doi.org/10.1007/s00138-023-01498-9","url":null,"abstract":"<p>In the field of medical applications, precise localization of medical instruments and bone structures is crucial to ensure computer-assisted surgical interventions. In orthopedic surgery, existing devices typically rely on stereoscopic vision. Their purpose is to aid the surgeon in screw fixation of prostheses or bone removal. This article addresses the challenge of localizing a rigid object consisting of randomly arranged planar markers using a single camera. This approach is especially vital in medical situations where accurate object alignment relative to a camera is necessary at distances ranging from 80 cm to 120 cm. In addition, the size limitation of a few tens of centimeters ensures that the resulting locator does not obstruct the work area. This rigid locator consists of a solid at the surface of which a set of plane markers (ArUco) are glued. These plane markers are randomly distributed over the surface in order to systematically have a minimum of two visible markers whatever the orientation of the locator. The calibration of the locator involves finding the relative positions between the individual planar elements and is based on a bundle adjustment approach. One of the main and known difficulties associated with planar markers is the problem of pose ambiguity. To solve this problem, our method lies in the formulation of an efficient initial solution for the optimization step. After the calibration step, the reached positioning uncertainties of the locator are better than two-tenth of a cubic millimeter and one-tenth of a degree, regardless of the orientation of the locator in space. To assess the proposed method, the locator is rigidly attached to a stylus of about twenty centimeters length. Thanks to this approach, the tip of this stylus seen by a 16.1 megapixel camera at a distance of about 1 m is localized in real time in a cube lower than 1 mm side. A surface registration application is proposed by using the stylus on an artificial scapula.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"129 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Local region-learning modules for point cloud classification 用于点云分类的局部区域学习模块

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2023-12-21 DOI: 10.1007/s00138-023-01495-y

Kaya Turgut, Helin Dutagaci

Data organization via forming local regions is an integral part of deep learning networks that process 3D point clouds in a hierarchical manner. At each level, the point cloud is sampled to extract representative points and these points are used to be centers of local regions. The organization of local regions is of considerable importance since it determines the location and size of the receptive field at a particular layer of feature aggregation. In this paper, we present two local region-learning modules: Center Shift Module to infer the appropriate shift for each center point, and Radius Update Module to alter the radius of each local region. The parameters of the modules are learned through optimizing the loss associated with the particular task within an end-to-end network. We present alternatives for these modules through various ways of modeling the interactions of the features and locations of 3D points in the point cloud. We integrated both modules independently and together to the PointNet++ and PointCNN object classification architectures, and demonstrated that the modules contributed to a significant increase in classification accuracy for the ScanObjectNN data set consisting of scans of real-world objects. Our further experiments on ShapeNet data set showed that the modules are also effective on 3D CAD models.

通过形成局部区域来组织数据是深度学习网络不可分割的一部分，深度学习网络以分层方式处理三维点云。在每个层次上，点云都要进行采样以提取代表性点，并将这些点用作局部区域的中心。局部区域的组织相当重要，因为它决定了特定特征聚合层的感受野的位置和大小。本文提出了两个局部区域学习模块：中心偏移模块用于推断每个中心点的适当偏移，半径更新模块用于改变每个局部区域的半径。这些模块的参数是通过优化端到端网络中与特定任务相关的损失来学习的。我们通过对点云中三维点的特征和位置的相互作用进行建模的各种方法，提出了这些模块的替代方案。我们在 PointNet++ 和 PointCNN 物体分类架构中单独或共同集成了这两个模块，并证明了这些模块有助于显著提高由真实世界物体扫描组成的 ScanObjectNN 数据集的分类准确性。我们在 ShapeNet 数据集上的进一步实验表明，这些模块对 3D CAD 模型也同样有效。

{"title":"Local region-learning modules for point cloud classification","authors":"Kaya Turgut, Helin Dutagaci","doi":"10.1007/s00138-023-01495-y","DOIUrl":"https://doi.org/10.1007/s00138-023-01495-y","url":null,"abstract":"<p>Data organization via forming local regions is an integral part of deep learning networks that process 3D point clouds in a hierarchical manner. At each level, the point cloud is sampled to extract representative points and these points are used to be centers of local regions. The organization of local regions is of considerable importance since it determines the location and size of the receptive field at a particular layer of feature aggregation. In this paper, we present two local region-learning modules: Center Shift Module to infer the appropriate shift for each center point, and Radius Update Module to alter the radius of each local region. The parameters of the modules are learned through optimizing the loss associated with the particular task within an end-to-end network. We present alternatives for these modules through various ways of modeling the interactions of the features and locations of 3D points in the point cloud. We integrated both modules independently and together to the PointNet++ and PointCNN object classification architectures, and demonstrated that the modules contributed to a significant increase in classification accuracy for the ScanObjectNN data set consisting of scans of real-world objects. Our further experiments on ShapeNet data set showed that the modules are also effective on 3D CAD models.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"307 5 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0