Machine Vision and Applications最新文献_第9页

Temporal teacher with masked transformers for semi-supervised action proposal generation 半监督行动建议生成时态教师与遮蔽变换器

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-15 DOI: 10.1007/s00138-024-01521-7

Selen Pehlivan, Jorma Laaksonen

By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.

通过以单元级预测为条件，用于生成行动建议的无锚模型已显示出令人印象深刻的能力，如轻量级架构。然而，任务性能在很大程度上取决于训练中使用的数据质量，而大多数有效的模型都依赖于人类标注的数据。半监督学习，即使用标注数据集和未标注数据集联合训练深度神经网络，最近取得了重大进展。现有的研究要么主要关注分类任务，这可能需要较少的标注工作，要么考虑基于锚的检测模型。受无锚对象检测器半监督方法最新进展的启发，我们提出了一个两阶段动作检测管道的师生框架，命名为 "带屏蔽变换器的时态教师"（TTMT），以基于无锚变换器模型生成高质量的动作建议。该模型利用一致性学习作为一种自我训练技术，以互惠互利的方式联合训练无锚学生模型和渐进教师模型。作为核心模型，我们设计了一个基于转换器的无锚模型，以提高时态评估的有效性。我们整合了双向掩码，并为序列设计了仅用于编码器的掩码变换器。通过对边界位置和各种基于局部片段的特征进行联合训练，我们的模型通过提议的评分函数进行预测，从而生成提案候选。在 THUMOS14 和 ActivityNet-1.3 基准上进行的实验证明了我们的模型在时态提案生成任务中的有效性。

{"title":"Temporal teacher with masked transformers for semi-supervised action proposal generation","authors":"Selen Pehlivan, Jorma Laaksonen","doi":"10.1007/s00138-024-01521-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01521-7","url":null,"abstract":"By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"67 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adversarial robustness improvement for deep neural networks 提高深度神经网络的对抗鲁棒性

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-14 DOI: 10.1007/s00138-024-01519-1

Charis Eleftheriadis, Andreas Symeonidis, Panagiotis Katsaros

Deep neural networks (DNNs) are key components for the implementation of autonomy in systems that operate in highly complex and unpredictable environments (self-driving cars, smart traffic systems, smart manufacturing, etc.). It is well known that DNNs are vulnerable to adversarial examples, i.e. minimal and usually imperceptible perturbations, applied to their inputs, leading to false predictions. This threat poses critical challenges, especially when DNNs are deployed in safety or security-critical systems, and renders as urgent the need for defences that can improve the trustworthiness of DNN functions. Adversarial training has proven effective in improving the robustness of DNNs against a wide range of adversarial perturbations. However, a general framework for adversarial defences is needed that will extend beyond a single-dimensional assessment of robustness improvement; it is essential to consider simultaneously several distance metrics and adversarial attack strategies. Using such an approach we report the results from extensive experimentation on adversarial defence methods that could improve DNNs resilience to adversarial threats. We wrap up by introducing a general adversarial training methodology, which, according to our experimental results, opens prospects for an holistic defence against a range of diverse types of adversarial perturbations.

深度神经网络（DNN）是在高度复杂和不可预测的环境（自动驾驶汽车、智能交通系统、智能制造等）中运行的系统实现自动驾驶的关键组件。众所周知，DNNs 很容易受到对抗范例的影响，即对其输入施加最小且通常难以察觉的扰动，从而导致错误预测。这种威胁带来了严峻的挑战，尤其是当 DNN 被部署到安全或安保关键系统中时，因此迫切需要能够提高 DNN 功能可信度的防御措施。事实证明，对抗性训练能有效提高 DNN 的鲁棒性，使其免受各种对抗性扰动的影响。然而，我们需要一个通用的对抗性防御框架，它将超越对鲁棒性改进的单一维度评估；同时考虑多个距离度量和对抗性攻击策略至关重要。利用这种方法，我们报告了对抗性防御方法的广泛实验结果，这些方法可以提高 DNN 对对抗性威胁的复原力。最后，我们介绍了一种通用的对抗训练方法，根据我们的实验结果，这种方法为全面防御各种类型的对抗扰动开辟了前景。

{"title":"Adversarial robustness improvement for deep neural networks","authors":"Charis Eleftheriadis, Andreas Symeonidis, Panagiotis Katsaros","doi":"10.1007/s00138-024-01519-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01519-1","url":null,"abstract":"Deep neural networks (DNNs) are key components for the implementation of autonomy in systems that operate in highly complex and unpredictable environments (self-driving cars, smart traffic systems, smart manufacturing, etc.). It is well known that DNNs are vulnerable to adversarial examples, i.e. minimal and usually imperceptible perturbations, applied to their inputs, leading to false predictions. This threat poses critical challenges, especially when DNNs are deployed in safety or security-critical systems, and renders as urgent the need for defences that can improve the trustworthiness of DNN functions. Adversarial training has proven effective in improving the robustness of DNNs against a wide range of adversarial perturbations. However, a general framework for adversarial defences is needed that will extend beyond a single-dimensional assessment of robustness improvement; it is essential to consider simultaneously several distance metrics and adversarial attack strategies. Using such an approach we report the results from extensive experimentation on adversarial defence methods that could improve DNNs resilience to adversarial threats. We wrap up by introducing a general adversarial training methodology, which, according to our experimental results, opens prospects for an holistic defence against a range of diverse types of adversarial perturbations.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"76 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FESAR: SAR ship detection model based on local spatial relationship capture and fused convolutional enhancement FESAR：基于局部空间关系捕捉和融合卷积增强的合成孔径雷达船舶探测模型

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-08 DOI: 10.1007/s00138-024-01516-4

Chongchong Liu, Chunman Yan

Synthetic aperture radar (SAR) is instrumental in ship monitoring owing to its all-weather capabilities and high resolution. In SAR images, ship targets frequently display blurred or mixed boundaries with the background, and instances of occlusion or partial occlusion may occur. Additionally, multi-scale transformations and small-target ships pose challenges for ship detection. To tackle these challenges, we propose a novel SAR ship detection model, FESAR. Firstly, in addressing multi-scale transformations in ship detection, we propose the Fused Convolution Enhancement Module (FCEM). This network incorporates distinct convolutional branches designed to capture local and global features, which are subsequently fused and enhanced. Secondly, a Spatial Relationship Analysis Module (SRAM) with a spatial-mixing layer is designed to analyze the local spatial relationship between the ship target and the background, effectively combining local information to discern feature distinctions between the ship target and the background. Finally, a new backbone network, SPD-YOLO, is designed to perform deep downsampling for the comprehensive extraction of semantic information related to ships. To validate the model’s performance, an extensive series of experiments was conducted on the public datasets HRSID, LS-SSDD-v1.0, and SSDD. The results demonstrate the outstanding performance of the proposed FESAR model compared to numerous state-of-the-art (SOTA) models. Relative to the baseline model, FESAR exhibits an improvement in mAP by 2.6% on the HRSID dataset, 5.5% on LS-SSDD-v1.0, and 0.2% on the SSDD dataset. In comparison with numerous SAR ship detection models, FESAR demonstrates superior comprehensive performance.

合成孔径雷达（SAR）因其全天候功能和高分辨率而在船舶监测方面发挥着重要作用。在合成孔径雷达图像中，船舶目标经常显示模糊或与背景混合的边界，并可能出现遮挡或部分遮挡的情况。此外，多尺度变换和小目标船只也给船只检测带来了挑战。为了应对这些挑战，我们提出了一种新型合成孔径雷达船舶检测模型 FESAR。首先，针对船舶检测中的多尺度变换，我们提出了融合卷积增强模块（FCEM）。该网络包含不同的卷积分支，旨在捕捉局部和全局特征，然后进行融合和增强。其次，设计了一个带有空间混合层的空间关系分析模块（SRAM），用于分析船舶目标与背景之间的局部空间关系，有效地结合局部信息来辨别船舶目标与背景之间的特征区别。最后，设计了一个新的骨干网络 SPD-YOLO，用于执行深度下采样，以全面提取与船舶相关的语义信息。为了验证模型的性能，我们在公共数据集 HRSID、LS-SSDD-v1.0 和 SSDD 上进行了一系列广泛的实验。结果表明，与众多最先进的（SOTA）模型相比，所提出的 FESAR 模型性能卓越。与基线模型相比，FESAR 在 HRSID 数据集上的 mAP 提高了 2.6%，在 LS-SSDD-v1.0 上提高了 5.5%，在 SSDD 数据集上提高了 0.2%。与众多合成孔径雷达船舶探测模型相比，FESAR 的综合性能更胜一筹。

{"title":"FESAR: SAR ship detection model based on local spatial relationship capture and fused convolutional enhancement","authors":"Chongchong Liu, Chunman Yan","doi":"10.1007/s00138-024-01516-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01516-4","url":null,"abstract":"Synthetic aperture radar (SAR) is instrumental in ship monitoring owing to its all-weather capabilities and high resolution. In SAR images, ship targets frequently display blurred or mixed boundaries with the background, and instances of occlusion or partial occlusion may occur. Additionally, multi-scale transformations and small-target ships pose challenges for ship detection. To tackle these challenges, we propose a novel SAR ship detection model, FESAR. Firstly, in addressing multi-scale transformations in ship detection, we propose the Fused Convolution Enhancement Module (FCEM). This network incorporates distinct convolutional branches designed to capture local and global features, which are subsequently fused and enhanced. Secondly, a Spatial Relationship Analysis Module (SRAM) with a spatial-mixing layer is designed to analyze the local spatial relationship between the ship target and the background, effectively combining local information to discern feature distinctions between the ship target and the background. Finally, a new backbone network, SPD-YOLO, is designed to perform deep downsampling for the comprehensive extraction of semantic information related to ships. To validate the model’s performance, an extensive series of experiments was conducted on the public datasets HRSID, LS-SSDD-v1.0, and SSDD. The results demonstrate the outstanding performance of the proposed FESAR model compared to numerous state-of-the-art (SOTA) models. Relative to the baseline model, FESAR exhibits an improvement in mAP by 2.6% on the HRSID dataset, 5.5% on LS-SSDD-v1.0, and 0.2% on the SSDD dataset. In comparison with numerous SAR ship detection models, FESAR demonstrates superior comprehensive performance.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"52 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140071207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An adaptive interpolation and 3D reconstruction algorithm for underwater images 水下图像的自适应插值和三维重建算法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-07 DOI: 10.1007/s00138-024-01518-2

Zhijie Tang, Congqi Xu, Siyu Yan

3D reconstruction technology is gradually applied to underwater scenes, which has become a crucial research direction for human ocean exploration and exploitation. However, due to the complexity of the underwater environment, the number of high-quality underwater images acquired by underwater robots is limited and cannot meet the requirements of 3D reconstruction. Therefore, this paper proposes an adaptive 3D reconstruction algorithm for underwater targets. We apply the frame interpolation technique to underwater 3D reconstruction, an unprecedented technical attempt. In this paper, we design a single-stage large-angle span underwater image interpolation model, which has an excellent enhancement effect on degraded underwater 2D images compared with other methods. Current methods make it challenging to balance the relationship between feature information acquisition and underwater image quality improvement. In this paper, an optimized cascaded feature pyramid scheme and an adaptive bidirectional optical flow estimation algorithm based on underwater NRIQA metrics are proposed and applied to the proposed model to solve the above problems. The intermediate image output from the model improves the image quality and retains the detailed information. Experiments show that the method proposed in this paper outperforms other methods when dealing with several typical degradation types of underwater images. In underwater 3D reconstruction, the intermediate image generated by the model is used as input instead of the degraded image to obtain a denser 3D point cloud and better visualization. Our method is instructive to the problem of acquiring underwater high-quality target images and underwater 3D reconstruction.

三维重建技术正逐步应用于水下场景，成为人类海洋探测与开发的重要研究方向。然而，由于水下环境的复杂性，水下机器人获取的高质量水下图像数量有限，无法满足三维重建的要求。因此，本文提出了一种针对水下目标的自适应三维重建算法。我们将帧插值技术应用于水下三维重建，这是一次前所未有的技术尝试。本文设计了一种单级大角度跨度水下图像插值模型，与其他方法相比，该模型对劣化的水下二维图像有很好的增强效果。目前的方法在平衡特征信息获取与水下图像质量提升之间的关系上存在挑战。本文提出了一种优化的级联特征金字塔方案和基于水下 NRIQA 指标的自适应双向光流估计算法，并将其应用于所提出的模型，以解决上述问题。模型输出的中间图像提高了图像质量并保留了细节信息。实验表明，在处理几种典型的水下图像退化类型时，本文提出的方法优于其他方法。在水下三维重建中，使用模型生成的中间图像代替退化图像作为输入，可以获得更密集的三维点云和更好的可视化效果。我们的方法对获取水下高质量目标图像和水下三维重建问题具有指导意义。

{"title":"An adaptive interpolation and 3D reconstruction algorithm for underwater images","authors":"Zhijie Tang, Congqi Xu, Siyu Yan","doi":"10.1007/s00138-024-01518-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01518-2","url":null,"abstract":"3D reconstruction technology is gradually applied to underwater scenes, which has become a crucial research direction for human ocean exploration and exploitation. However, due to the complexity of the underwater environment, the number of high-quality underwater images acquired by underwater robots is limited and cannot meet the requirements of 3D reconstruction. Therefore, this paper proposes an adaptive 3D reconstruction algorithm for underwater targets. We apply the frame interpolation technique to underwater 3D reconstruction, an unprecedented technical attempt. In this paper, we design a single-stage large-angle span underwater image interpolation model, which has an excellent enhancement effect on degraded underwater 2D images compared with other methods. Current methods make it challenging to balance the relationship between feature information acquisition and underwater image quality improvement. In this paper, an optimized cascaded feature pyramid scheme and an adaptive bidirectional optical flow estimation algorithm based on underwater NRIQA metrics are proposed and applied to the proposed model to solve the above problems. The intermediate image output from the model improves the image quality and retains the detailed information. Experiments show that the method proposed in this paper outperforms other methods when dealing with several typical degradation types of underwater images. In underwater 3D reconstruction, the intermediate image generated by the model is used as input instead of the degraded image to obtain a denser 3D point cloud and better visualization. Our method is instructive to the problem of acquiring underwater high-quality target images and underwater 3D reconstruction.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"66 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised Siamese keypoint inference network for human pose estimation and tracking 用于人体姿态估计和跟踪的自监督连体关键点推理网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-05 DOI: 10.1007/s00138-024-01515-5

Abstract

Human pose estimation and tracking are important tasks to help understand human behavior. Currently, human pose estimation and tracking face the challenges of missed detection due to sparse annotation of video datasets and difficulty in associating partially occluded and unoccluded cases of the same person. To address these challenges, we propose a self-supervised learning-based method, which infers the correspondence between keypoints to associate persons in the videos. Specifically, we propose a bounding box recovery module to recover missed detections and a Siamese keypoint inference network to solve the issue of error matching caused by occlusions. The local–global attention module, which is designed in the Siamese keypoint inference network, learns the varying dependence information of human keypoints between frames. To simulate the occlusions, we mask random pixels in the image before pre-training using knowledge distillation to associate the differing occlusions of the same person. Our method achieves better results than state-of-the-art methods for human pose estimation and tracking on the PoseTrack 2018 and PoseTrack 2021 datasets. Code is available at: https://github.com/yhtian2023/SKITrack.

摘要人体姿态估计和跟踪是帮助理解人类行为的重要任务。目前，人体姿态估计和跟踪面临着由于视频数据集注释稀疏而导致的漏检以及难以将同一人的部分遮挡和未遮挡情况联系起来的挑战。为了应对这些挑战，我们提出了一种基于自我监督学习的方法，该方法通过推断关键点之间的对应关系来关联视频中的人物。具体来说，我们提出了一个边界框恢复模块来恢复遗漏的检测，并提出了一个连体关键点推理网络来解决因遮挡造成的错误匹配问题。在连体关键点推理网络中设计的局部-全局注意力模块可以学习帧间人类关键点的不同依赖信息。为了模拟遮挡，我们在预训练前屏蔽了图像中的随机像素，利用知识提炼来关联同一人物的不同遮挡。在 PoseTrack 2018 和 PoseTrack 2021 数据集上，我们的方法比最先进的人类姿势估计和跟踪方法取得了更好的结果。代码见：https://github.com/yhtian2023/SKITrack。

{"title":"Self-supervised Siamese keypoint inference network for human pose estimation and tracking","authors":"","doi":"10.1007/s00138-024-01515-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01515-5","url":null,"abstract":"<h3>Abstract</h3> Human pose estimation and tracking are important tasks to help understand human behavior. Currently, human pose estimation and tracking face the challenges of missed detection due to sparse annotation of video datasets and difficulty in associating partially occluded and unoccluded cases of the same person. To address these challenges, we propose a self-supervised learning-based method, which infers the correspondence between keypoints to associate persons in the videos. Specifically, we propose a bounding box recovery module to recover missed detections and a Siamese keypoint inference network to solve the issue of error matching caused by occlusions. The local–global attention module, which is designed in the Siamese keypoint inference network, learns the varying dependence information of human keypoints between frames. To simulate the occlusions, we mask random pixels in the image before pre-training using knowledge distillation to associate the differing occlusions of the same person. Our method achieves better results than state-of-the-art methods for human pose estimation and tracking on the PoseTrack 2018 and PoseTrack 2021 datasets. Code is available at: https://github.com/yhtian2023/SKITrack.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"54 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

That’s BAD: blind anomaly detection by implicit local feature clustering 真糟糕：通过隐式局部特征聚类进行盲目异常检测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-03-02 DOI: 10.1007/s00138-024-01511-9

Jie Zhang, Masanori Suganuma, Takayuki Okatani

Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (i.e., anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalies in a given set of images that might contain both normal and anomalous samples. The setting does not assume the availability of known normal data and thus is completely free from human annotation, which differs from the standard AD considered in recent studies. For clarity, we call the setting blind anomaly detection (BAD). We show that BAD can be converted into a local outlier detection problem and propose a novel method named PatchCluster that can accurately detect image- and pixel-level anomalies. Experimental results show that PatchCluster shows a promising performance without the knowledge of normal data, even comparable to the SOTA methods applied in the one-class setting needing it.

最近关于工业物体/纹理视觉异常检测（AD）的研究取得了相当不错的效果。这些研究考虑了无监督环境，特别是单类环境，其中我们假设有一组正常（即无异常）图像用于训练。在本文中，我们考虑的是更具挑战性的无监督 AD 场景，即在一组给定的图像中检测异常情况，这组图像可能既包含正常样本，也包含异常样本。这种情况不假定存在已知的正常数据，因此完全不需要人工标注，这与近期研究中考虑的标准 AD 有所不同。为清楚起见，我们称这种设置为盲法异常检测（BAD）。我们的研究表明，BAD 可以转化为局部异常点检测问题，并提出了一种名为 PatchCluster 的新方法，该方法可以准确检测图像和像素级异常点。实验结果表明，PatchCluster 在不了解正常数据的情况下表现出了良好的性能，甚至可以与在单类设置中应用的 SOTA 方法相媲美。

引用次数: 0

A pixel and channel enhanced up-sampling module for biomedical image segmentation 用于生物医学图像分割的像素和通道增强型上采样模块

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-02-22 DOI: 10.1007/s00138-024-01513-7

Xuan Zhang, Guoping Xu, Xinglong Wu, Wentao Liao, Xuesong Leng, Xiaxia Wang, Xinwei He, Chang Li

Up-sampling operations are frequently utilized to recover the spatial resolution of feature maps in neural networks for segmentation task. However, current up-sampling methods, such as bilinear interpolation or deconvolution, do not fully consider the relationship of feature maps, which have negative impact on learning discriminative features for semantic segmentation. In this paper, we propose a pixel and channel enhanced up-sampling (PCE) module for low-resolution feature maps, aiming to use the relationship of adjacent pixels and channels for learning discriminative high-resolution feature maps. Specifically, the proposed up-sampling module includes two main operations: (1) increasing spatial resolution of feature maps with pixel shuffle and (2) recalibrating channel-wise high-resolution feature response. Our proposed up-sampling module could be integrated into CNN and Transformer segmentation architectures. Extensive experiments on three different modality datasets of biomedical images, including computed tomography (CT), magnetic resonance imaging (MRI) and micro-optical sectioning tomography images (MOST) demonstrate the proposed method could effectively improve the performance of representative segmentation models.

上采样操作经常被用来恢复神经网络中特征图的空间分辨率，以完成分割任务。然而，目前的上采样方法，如双线性插值或解卷积，并没有充分考虑特征图之间的关系，这对学习语义分割的判别特征有负面影响。本文提出了针对低分辨率特征图的像素和信道增强上采样（PCE）模块，旨在利用相邻像素和信道的关系来学习高分辨率特征图。具体来说，所提出的上采样模块包括两个主要操作：(1) 通过像素洗牌提高特征图的空间分辨率；(2) 重新校准高分辨率特征响应的通道。我们提出的上采样模块可集成到 CNN 和变换器分割架构中。在生物医学图像的三种不同模式数据集（包括计算机断层扫描（CT）、磁共振成像（MRI）和微光切片断层扫描图像（MOST））上进行的广泛实验表明，所提出的方法能有效提高代表性分割模型的性能。

{"title":"A pixel and channel enhanced up-sampling module for biomedical image segmentation","authors":"Xuan Zhang, Guoping Xu, Xinglong Wu, Wentao Liao, Xuesong Leng, Xiaxia Wang, Xinwei He, Chang Li","doi":"10.1007/s00138-024-01513-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01513-7","url":null,"abstract":"Up-sampling operations are frequently utilized to recover the spatial resolution of feature maps in neural networks for segmentation task. However, current up-sampling methods, such as bilinear interpolation or deconvolution, do not fully consider the relationship of feature maps, which have negative impact on learning discriminative features for semantic segmentation. In this paper, we propose a pixel and channel enhanced up-sampling (PCE) module for low-resolution feature maps, aiming to use the relationship of adjacent pixels and channels for learning discriminative high-resolution feature maps. Specifically, the proposed up-sampling module includes two main operations: (1) increasing spatial resolution of feature maps with pixel shuffle and (2) recalibrating channel-wise high-resolution feature response. Our proposed up-sampling module could be integrated into CNN and Transformer segmentation architectures. Extensive experiments on three different modality datasets of biomedical images, including computed tomography (CT), magnetic resonance imaging (MRI) and micro-optical sectioning tomography images (MOST) demonstrate the proposed method could effectively improve the performance of representative segmentation models.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139952540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A gradient fusion-based image data augmentation method for reflective workpieces detection under small size datasets 基于梯度融合的图像数据增强方法，用于小尺寸数据集下的反射工件检测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-02-21 DOI: 10.1007/s00138-024-01512-8

Baori Zhang, Haolang Cai, Lingxiang Wen

Various of Convolutional Neural Network-based object detection models have been widely used in the industrial field. However, the high accuracy of the object detection of these models is difficult to obtain in the industrial sorting line. This is due to the use of small dataset considering of production cost and the changing features of the reflective workpiece. In order to increase the detecting accuracy, a gradient fusion-based image data augmentation method was presented in this paper. It consisted of a high-dynamic range (HDR) exposing algorithm and an image reconstructing algorithm. It augmented the image data for the training and predicting by increasing the feature richness within the regions of reflection and shadow of the image. Tests were conducted on the comparison with other exposing and image fusion methods. The universality of the proposed method was analyzed by testing on various kinds of workpieces and different models including YOLOv8 and SSD. Finally, the Gradient-weighted Class Activation Mapping (Grad-CAM) method and Mean Average Precision (mAP) were used to analyze the model performance improvement. The results showed that the proposed data augmentation method improved the feature richness of the image and the accuracy of the object detection for the reflective workpieces under small size datasets.

各种基于卷积神经网络的物体检测模型已被广泛应用于工业领域。然而，在工业分拣线中，这些模型很难实现高精度的物体检测。这是由于考虑到生产成本和反光工件不断变化的特征，使用的数据集较小。为了提高检测精度，本文提出了一种基于梯度融合的图像数据增强方法。该方法由高动态范围（HDR）曝光算法和图像重建算法组成。它通过增加图像反射区和阴影区的特征丰富度来增强用于训练和预测的图像数据。与其他曝光和图像融合方法进行了对比测试。通过对各种工件和不同模型（包括 YOLOv8 和 SSD）进行测试，分析了所提出方法的通用性。最后，使用梯度加权类激活映射（Grad-CAM）方法和平均精度（mAP）来分析模型的性能改进。结果表明，所提出的数据增强方法提高了图像的特征丰富度和小尺寸数据集下反光工件的物体检测精度。

{"title":"A gradient fusion-based image data augmentation method for reflective workpieces detection under small size datasets","authors":"Baori Zhang, Haolang Cai, Lingxiang Wen","doi":"10.1007/s00138-024-01512-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01512-8","url":null,"abstract":"Various of Convolutional Neural Network-based object detection models have been widely used in the industrial field. However, the high accuracy of the object detection of these models is difficult to obtain in the industrial sorting line. This is due to the use of small dataset considering of production cost and the changing features of the reflective workpiece. In order to increase the detecting accuracy, a gradient fusion-based image data augmentation method was presented in this paper. It consisted of a high-dynamic range (HDR) exposing algorithm and an image reconstructing algorithm. It augmented the image data for the training and predicting by increasing the feature richness within the regions of reflection and shadow of the image. Tests were conducted on the comparison with other exposing and image fusion methods. The universality of the proposed method was analyzed by testing on various kinds of workpieces and different models including YOLOv8 and SSD. Finally, the Gradient-weighted Class Activation Mapping (Grad-CAM) method and Mean Average Precision (mAP) were used to analyze the model performance improvement. The results showed that the proposed data augmentation method improved the feature richness of the image and the accuracy of the object detection for the reflective workpieces under small size datasets.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"18 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139926925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Target–distractor memory joint tracking algorithm via Credit Allocation Network 通过信用分配网络实现的目标-脱离者记忆联合跟踪算法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-02-09 DOI: 10.1007/s00138-024-01508-4

Huanlong Zhang, Panyun Wang, Zhiwu Chen, Jie Zhang, Linwei Li

The tracking framework based on the memory network has gained significant attention due to its enhanced adaptability to variations in target appearance. However, the performance of the framework is limited by the negative effects of distractors in the background. Hence, this paper proposes a method for tracking using Credit Allocation Network to join target and distractor memory. Specifically, we design a Credit Allocation Network (CAN) that is updated online via Guided Focus Loss. The CAN produces credit scores for tracking results by learning features of the target object, ensuring the update of reliable samples for storage in the memory pool. Furthermore, we construct a multi-domain memory model that simultaneously captures target and background information from multiple historical intervals, which can build a more compatible object appearance model while increasing the diversity of the memory sample. Moreover, a novel target–distractor joint localization strategy is presented, which read target and distractor information from memory frames based on cross-attention, so as to cancel out wrong responses in the target response map by using the distractor response map. The experimental results on OTB-2015, GOT-10k, UAV123, LaSOT, and VOT-2018 datasets show the competitiveness and effectiveness of the proposed method compared to other trackers.

基于记忆网络的跟踪框架因其对目标外观变化的适应性更强而备受关注。然而，该框架的性能受到背景中干扰物负面影响的限制。因此，本文提出了一种使用信用分配网络（Credit Allocation Network）来连接目标和干扰记忆的跟踪方法。具体来说，我们设计了一个信用分配网络（CAN），该网络通过 "引导焦点丢失"（Guided Focus Loss）进行在线更新。该网络通过学习目标对象的特征，为跟踪结果生成信用分数，确保更新可靠的样本以存储在记忆池中。此外，我们还构建了一个多域记忆模型，可同时捕捉来自多个历史时间间隔的目标和背景信息，从而在增加记忆样本多样性的同时，建立一个兼容性更强的物体外观模型。此外，我们还提出了一种新颖的目标-分心联合定位策略，该策略基于交叉注意从记忆帧中读取目标和分心信息，从而利用分心响应图抵消目标响应图中的错误响应。在 OTB-2015、GOT-10k、UAV123、LaSOT 和 VOT-2018 数据集上的实验结果表明，与其他跟踪器相比，所提出的方法具有竞争力和有效性。

{"title":"Target–distractor memory joint tracking algorithm via Credit Allocation Network","authors":"Huanlong Zhang, Panyun Wang, Zhiwu Chen, Jie Zhang, Linwei Li","doi":"10.1007/s00138-024-01508-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01508-4","url":null,"abstract":"The tracking framework based on the memory network has gained significant attention due to its enhanced adaptability to variations in target appearance. However, the performance of the framework is limited by the negative effects of distractors in the background. Hence, this paper proposes a method for tracking using Credit Allocation Network to join target and distractor memory. Specifically, we design a Credit Allocation Network (CAN) that is updated online via Guided Focus Loss. The CAN produces credit scores for tracking results by learning features of the target object, ensuring the update of reliable samples for storage in the memory pool. Furthermore, we construct a multi-domain memory model that simultaneously captures target and background information from multiple historical intervals, which can build a more compatible object appearance model while increasing the diversity of the memory sample. Moreover, a novel target–distractor joint localization strategy is presented, which read target and distractor information from memory frames based on cross-attention, so as to cancel out wrong responses in the target response map by using the distractor response map. The experimental results on OTB-2015, GOT-10k, UAV123, LaSOT, and VOT-2018 datasets show the competitiveness and effectiveness of the proposed method compared to other trackers.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"133 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end optimized image compression with the frequency-oriented transform 利用面向频率的变换进行端到端优化图像压缩

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-02-07 DOI: 10.1007/s00138-023-01507-x

Yuefeng Zhang, Kai Lin

Image compression constitutes a significant challenge amid the era of information explosion. Recent studies employing deep learning methods have demonstrated the superior performance of learning-based image compression methods over traditional codecs. However, an inherent challenge associated with these methods lies in their lack of interpretability. Following an analysis of the varying degrees of compression degradation across different frequency bands, we propose the end-to-end optimized image compression model facilitated by the frequency-oriented transform. The proposed end-to-end image compression model consists of four components: spatial sampling, frequency-oriented transform, entropy estimation, and frequency-aware fusion. The frequency-oriented transform separates the original image signal into distinct frequency bands, aligning with the human-interpretable concept. Leveraging the non-overlapping hypothesis, the model enables scalable coding through the selective transmission of arbitrary frequency components. Extensive experiments are conducted to demonstrate that our model outperforms all traditional codecs including next-generation standard H.266/VVC on MS-SSIM metric. Moreover, visual analysis tasks (i.e., object detection and semantic segmentation) are conducted to verify the proposed compression method that could preserve semantic fidelity besides signal-level precision.

在信息爆炸的时代，图像压缩是一项重大挑战。最近采用深度学习方法的研究表明，基于学习的图像压缩方法比传统编解码器性能更优越。然而，与这些方法相关的一个固有挑战在于它们缺乏可解释性。在分析了不同频段的不同压缩劣化程度后，我们提出了端到端优化图像压缩模型，该模型通过面向频率的变换得以实现。所提出的端到端图像压缩模型由四个部分组成：空间采样、频率导向变换、熵估计和频率感知融合。频率导向变换将原始图像信号分离成不同的频段，符合人类可理解的概念。利用非重叠假设，该模型可通过选择性传输任意频率成分实现可扩展编码。大量实验证明，我们的模型在 MS-SSIM 指标上优于所有传统编解码器，包括下一代标准 H.266/VVC。此外，还进行了视觉分析任务（即物体检测和语义分割），以验证所提出的压缩方法除了能保持信号级精度外，还能保持语义保真度。

{"title":"End-to-end optimized image compression with the frequency-oriented transform","authors":"Yuefeng Zhang, Kai Lin","doi":"10.1007/s00138-023-01507-x","DOIUrl":"https://doi.org/10.1007/s00138-023-01507-x","url":null,"abstract":"Image compression constitutes a significant challenge amid the era of information explosion. Recent studies employing deep learning methods have demonstrated the superior performance of learning-based image compression methods over traditional codecs. However, an inherent challenge associated with these methods lies in their lack of interpretability. Following an analysis of the varying degrees of compression degradation across different frequency bands, we propose the end-to-end optimized image compression model facilitated by the frequency-oriented transform. The proposed end-to-end image compression model consists of four components: spatial sampling, frequency-oriented transform, entropy estimation, and frequency-aware fusion. The frequency-oriented transform separates the original image signal into distinct frequency bands, aligning with the human-interpretable concept. Leveraging the non-overlapping hypothesis, the model enables scalable coding through the selective transmission of arbitrary frequency components. Extensive experiments are conducted to demonstrate that our model outperforms all traditional codecs including next-generation standard H.266/VVC on MS-SSIM metric. Moreover, visual analysis tasks (i.e., object detection and semantic segmentation) are conducted to verify the proposed compression method that could preserve semantic fidelity besides signal-level precision.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"33 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0