IET Computer Vision最新文献_第6页

Multi-modal video search by examples—A video quality impact analysis 通过实例进行多模式视频搜索--视频质量影响分析

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-27 DOI: 10.1049/cvi2.12303

Guanfeng Wu, Abbas Haider, Xing Tian, Erfan Loweimi, Chi Ho Chan, Mengjie Qian, Awan Muhammad, Ivor Spence, Rob Cooper, Wing W. Y. Ng, Josef Kittler, Mark Gales, Hui Wang

As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. An innovative Multi-modal Video Search by Examples (MVSE) framework is introduced, employing state-of-the-art techniques in its various components. In designing MVSE, the authors focused on accuracy, efficiency, interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. Furthermore, the framework was comprehensively evaluated, assessing individual components, data quality issues, and overall retrieval performance using high-quality and low-quality BBC archive videos. The evaluation reveals that: (1) multi-modal search yields better results than single-modal search; (2) the quality of video, both visual and audio, has an impact on the query precision. Compared with image query results, audio quality has a greater impact on the query precision (3) a two-stage search process (i.e. searching by Hamming distance based on hashing, followed by searching by Cosine similarity based on embedding); is effective but increases time overhead; (4) large-scale video retrieval is not only feasible but also expected to emerge shortly.

随着视频内容的不断激增，许多视频档案缺乏合适的元数据，因此，视频检索，尤其是通过基于实例的检索，变得越来越重要。现有的元数据往往无法满足特定类型搜索的需求，尤其是当视频包含视觉和音频等不同模式的元素时。因此，开发能够处理多模式内容的视频检索方法至关重要。本文介绍了一个创新的多模态视频示例搜索（MVSE）框架，该框架的各个组成部分都采用了最先进的技术。在设计 MVSE 时，作者将重点放在准确性、效率、交互性和可扩展性上，其中的关键组件包括高级数据处理和用户友好界面，旨在提高搜索效果和用户体验。此外，还对该框架进行了全面评估，使用高质量和低质量的 BBC 档案视频评估了各个组件、数据质量问题和整体检索性能。评估结果表明(1) 多模态搜索比单模态搜索产生更好的结果；(2) 视频质量，包括视觉和音频质量，对查询精度都有影响。与图像查询结果相比，音频质量对查询精度的影响更大；(3) 两阶段搜索过程（即基于哈希值的汉明距离搜索，然后是基于嵌入的余弦相似度搜索）是有效的，但会增加时间开销；(4) 大规模视频检索不仅可行，而且有望在短期内出现。

{"title":"Multi-modal video search by examples—A video quality impact analysis","authors":"Guanfeng Wu, Abbas Haider, Xing Tian, Erfan Loweimi, Chi Ho Chan, Mengjie Qian, Awan Muhammad, Ivor Spence, Rob Cooper, Wing W. Y. Ng, Josef Kittler, Mark Gales, Hui Wang","doi":"10.1049/cvi2.12303","DOIUrl":"10.1049/cvi2.12303","url":null,"abstract":"As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. An innovative Multi-modal Video Search by Examples (MVSE) framework is introduced, employing state-of-the-art techniques in its various components. In designing MVSE, the authors focused on accuracy, efficiency, interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. Furthermore, the framework was comprehensively evaluated, assessing individual components, data quality issues, and overall retrieval performance using high-quality and low-quality BBC archive videos. The evaluation reveals that: (1) multi-modal search yields better results than single-modal search; (2) the quality of video, both visual and audio, has an impact on the query precision. Compared with image query results, audio quality has a greater impact on the query precision (3) a two-stage search process (i.e. searching by Hamming distance based on hashing, followed by searching by Cosine similarity based on embedding); is effective but increases time overhead; (4) large-scale video retrieval is not only feasible but also expected to emerge shortly.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1017-1033"},"PeriodicalIF":1.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12303","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141798043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

2D human skeleton action recognition with spatial constraints 带空间约束的二维人体骨骼动作识别

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12296

Lei Wang, Jianwei Zhang, Wenbing Yang, Song Gu, Shanmin Yang

Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D-SCHAR) is introduced. 2D-SCHAR employs graph convolution networks to process graph-structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end-to-end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.

在视频监控场景中，人类动作主要以二维格式呈现，这就阻碍了对二维数据中不明显的动作细节的准确判断。深度估算可以帮助完成人类动作识别任务，提高神经网络的准确性。然而，依赖图像进行深度估计需要大量的计算资源，而且无法利用人体结构之间的连接性。此外，深度信息可能无法准确反映实际深度范围，因此需要提高可靠性。因此，我们引入了一种具有空间约束的二维人体骨骼动作识别方法（2D-SCHAR）。2D-SCHAR 采用图卷积网络来处理图结构的人体动作骨骼数据，包括深度估计、空间转换和动作识别三个部分。最初的两个部分从二维人体骨骼动作中推断三维信息，并生成空间变换参数以纠正动作数据中的异常偏差，这两个部分为模型中的后一个部分提供支持，以提高动作识别的准确性。该模型采用端到端多任务设计，允许这三个部分共享参数，以提高性能。实验结果验证了该模型在人体骨骼动作识别方面的有效性和优越性。

{"title":"2D human skeleton action recognition with spatial constraints","authors":"Lei Wang, Jianwei Zhang, Wenbing Yang, Song Gu, Shanmin Yang","doi":"10.1049/cvi2.12296","DOIUrl":"10.1049/cvi2.12296","url":null,"abstract":"Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D-SCHAR) is introduced. 2D-SCHAR employs graph convolution networks to process graph-structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end-to-end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"968-981"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12296","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets 中心损失--自助结账产品数据集中优于样本到样本的类别验证方法

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12302

Bernardas Ciapas, Povilas Treigys

Siamese networks excel at comparing two images, serving as an effective class verification technique for a single-per-class reference image. However, when multiple reference images are present, Siamese verification necessitates multiple comparisons and aggregation, often unpractical at inference. The Centre-Loss approach, proposed in this research, solves a class verification task more efficiently, using a single forward-pass during inference, than sample-to-sample approaches. Optimising a Centre-Loss function learns class centres and minimises intra-class distances in latent space. The authors compared verification accuracy using Centre-Loss against aggregated Siamese when other hyperparameters (such as neural network backbone and distance type) are the same. Experiments were performed to contrast the ubiquitous Euclidean against other distance types to discover the optimum Centre-Loss layer, its size, and Centre-Loss weight. In optimal architecture, the Centre-Loss layer is connected to the penultimate layer, calculates Euclidean distance, and its size depends on distance type. The Centre-Loss method was validated on the Self-Checkout products and Fruits 360 image datasets. Centre-Loss comparable accuracy and lesser complexity make it a preferred approach over sample-to-sample for the class verification task, when the number of reference image per class is high and inference speed is a factor, such as in self-checkouts.

连体网络擅长比较两幅图像，对于单幅参考图像而言，它是一种有效的类别验证技术。然而，当存在多个参考图像时，连体验证就必须进行多次比较和汇总，这在推理中往往是不切实际的。与样本到样本方法相比，本研究提出的中心损失方法在推理过程中只需一次前向传递，就能更高效地解决类别验证任务。优化中心-损失函数可以学习类中心，并将潜在空间中的类内距离最小化。作者比较了在其他超参数（如神经网络骨干和距离类型）相同的情况下，使用 Centre-Loss 与聚合 Siamese 的验证准确率。实验中，他们将无处不在的欧氏距离与其他距离类型进行了对比，从而发现了最佳的中心损失层、其大小和中心损失权重。在最佳架构中，中心损失层与倒数第二层相连，计算欧氏距离，其大小取决于距离类型。Centre-Loss 方法在自助结账产品和水果 360 图像数据集上进行了验证。与样本到样本方法相比，Centre-Loss 方法具有可比的准确性和较低的复杂性，因此在类别验证任务中，当每个类别的参考图像数量较多且推理速度是一个因素时，例如在自助结账中，Centre-Loss 方法是一种首选方法。

{"title":"Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets","authors":"Bernardas Ciapas, Povilas Treigys","doi":"10.1049/cvi2.12302","DOIUrl":"10.1049/cvi2.12302","url":null,"abstract":"Siamese networks excel at comparing two images, serving as an effective class verification technique for a single-per-class reference image. However, when multiple reference images are present, Siamese verification necessitates multiple comparisons and aggregation, often unpractical at inference. The Centre-Loss approach, proposed in this research, solves a class verification task more efficiently, using a single forward-pass during inference, than sample-to-sample approaches. Optimising a Centre-Loss function learns class centres and minimises intra-class distances in latent space. The authors compared verification accuracy using Centre-Loss against aggregated Siamese when other hyperparameters (such as neural network backbone and distance type) are the same. Experiments were performed to contrast the ubiquitous Euclidean against other distance types to discover the optimum Centre-Loss layer, its size, and Centre-Loss weight. In optimal architecture, the Centre-Loss layer is connected to the penultimate layer, calculates Euclidean distance, and its size depends on distance type. The Centre-Loss method was validated on the Self-Checkout products and Fruits 360 image datasets. Centre-Loss comparable accuracy and lesser complexity make it a preferred approach over sample-to-sample for the class verification task, when the number of reference image per class is high and inference speed is a factor, such as in self-checkouts.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1004-1016"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition GR-Former：基于骨架的驾驶员动作识别图形强化变换器

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-10 DOI: 10.1049/cvi2.12298

Zhuoyan Xu, Jingke Xu

In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive & Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.

在车内驾驶场景中，复合动作识别对于提高安全性和理解驾驶员意图至关重要。由于空间限制和遮挡因素，驾驶员的运动范围有限，因此会产生难以区分的相似动作模式。此外，收集能描述完整人体姿态的骨骼数据也很困难，这给动作识别带来了更多挑战。为了解决这些问题，我们提出了一种新颖的图形强化变换器（GR-Former）模型。作者的模型以有限的骨骼数据为输入，通过引入图结构信息来定向强化自我注意机制的效果，动态学习和聚合多层次关节间的特征，构建了一个更丰富的特征向量空间，增强了模型的表现力和识别准确性。基于 Drive & Act 数据集的复合动作识别，作者的工作只应用了人体上半身骨架数据，与现有方法相比取得了最先进的性能。使用完整的人体骨骼数据在 NTU RGB + D- 和 NTU RGB + D 120 数据集上也有极高的识别准确率，这证明了 GR-Former 的强大通用性。总体而言，作者的研究为车载场景中的驾驶员动作识别提供了一种新的有效解决方案。

{"title":"GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition","authors":"Zhuoyan Xu, Jingke Xu","doi":"10.1049/cvi2.12298","DOIUrl":"10.1049/cvi2.12298","url":null,"abstract":"In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive & Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"982-991"},"PeriodicalIF":1.5,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12298","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition 基于骨骼动作识别的多尺度骨骼简化图卷积网络

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-08 DOI: 10.1049/cvi2.12300

Fan Zhang, Ding Chongyang, Kai Liu, Liu Hongjin

Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.

基于图卷积网络（GCN）的人类动作识别是计算机视觉领域的热点之一。然而，以往的方法通常依赖于手工制作的图，这就限制了模型在描述间接连接的关节之间的连接时的有效性。当关节之间的距离较远时，这种限制会导致连接减弱。为解决上述问题，作者提出了一种骨架简化方法，旨在通过将相邻关节合并为简化关节来减少关节数量和关节间距。组卷积块用于提取简化关节的内部特征。此外，作者还通过引入多尺度建模，将输入映射到不同简化级别的序列中，从而增强了该方法。结合空间时间图卷积，作者提出了一种用于基于骨骼的动作识别的多尺度骨骼简化 GCN（M3S-GCN），用于融合多尺度骨骼序列并对关节之间的连接进行建模。最后，M3S-GCN 在 NTU RGB+D 60（C-Sub、C-View）、NTU RGB+D 120（X-Sub、X-Set）和 NW-UCLA 数据集的五个基准上进行了评估。实验结果表明，作者的 M3S-GCN 在 C-Sub、C-View 和 X-Set 基准上的准确率分别为 93.0%、97.0% 和 91.2%，达到了最先进的水平，验证了该方法的有效性。

{"title":"Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition","authors":"Fan Zhang, Ding Chongyang, Kai Liu, Liu Hongjin","doi":"10.1049/cvi2.12300","DOIUrl":"10.1049/cvi2.12300","url":null,"abstract":"Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"992-1003"},"PeriodicalIF":1.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141668289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recognition of European mammals and birds in camera trap images using deep neural networks 利用深度神经网络识别相机捕获图像中的欧洲哺乳动物和鸟类

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-03 DOI: 10.1049/cvi2.12294

Daniel Schneider, Kim Lindner, Markus Vogelbacher, Hicham Bellafkir, Nina Farwig, Bernd Freisleben

Most machine learning methods for animal recognition in camera trap images are limited to mammal identification and group birds into a single class. Machine learning methods for visually discriminating birds, in turn, cannot discriminate between mammals and are not designed for camera trap images. The authors present deep neural network models to recognise both mammals and bird species in camera trap images. They train neural network models for species classification as well as for predicting the animal taxonomy, that is, genus, family, order, group, and class names. Different neural network architectures, including ResNet, EfficientNetV2, Vision Transformer, Swin Transformer, and ConvNeXt, are compared for these tasks. Furthermore, the authors investigate approaches to overcome various challenges associated with camera trap image analysis. The authors’ best species classification models achieve a mean average precision (mAP) of 97.91% on a validation data set and mAPs of 90.39% and 82.77% on test data sets recorded in forests in Germany and Poland, respectively. Their best taxonomic classification models reach a validation mAP of 97.18% and mAPs of 94.23% and 79.92% on the two test data sets, respectively.

大多数在相机陷阱图像中进行动物识别的机器学习方法仅限于哺乳动物识别，并将鸟类归为一类。反过来，视觉识别鸟类的机器学习方法不能识别哺乳动物，也不是为相机陷阱图像设计的。作者提出了深度神经网络模型，用于识别相机陷阱图像中的哺乳动物和鸟类物种。他们训练神经网络模型用于物种分类以及预测动物分类，即属、科、目、群和类名称。针对这些任务比较了不同的神经网络架构，包括 ResNet、EfficientNetV2、Vision Transformer、Swin Transformer 和 ConvNeXt。此外，作者还研究了克服与相机陷阱图像分析相关的各种挑战的方法。作者的最佳物种分类模型在验证数据集上的平均精度 (mAP) 达到 97.91%，在德国和波兰森林记录的测试数据集上的平均精度 (mAP) 分别达到 90.39% 和 82.77%。他们的最佳分类模型的验证 mAP 为 97.18%，在两个测试数据集上的 mAP 分别为 94.23% 和 79.92%。

{"title":"Recognition of European mammals and birds in camera trap images using deep neural networks","authors":"Daniel Schneider, Kim Lindner, Markus Vogelbacher, Hicham Bellafkir, Nina Farwig, Bernd Freisleben","doi":"10.1049/cvi2.12294","DOIUrl":"10.1049/cvi2.12294","url":null,"abstract":"Most machine learning methods for animal recognition in camera trap images are limited to mammal identification and group birds into a single class. Machine learning methods for visually discriminating birds, in turn, cannot discriminate between mammals and are not designed for camera trap images. The authors present deep neural network models to recognise both mammals and bird species in camera trap images. They train neural network models for species classification as well as for predicting the animal taxonomy, that is, genus, family, order, group, and class names. Different neural network architectures, including ResNet, EfficientNetV2, Vision Transformer, Swin Transformer, and ConvNeXt, are compared for these tasks. Furthermore, the authors investigate approaches to overcome various challenges associated with camera trap image analysis. The authors’ best species classification models achieve a mean average precision (mAP) of 97.91% on a validation data set and mAPs of 90.39% and 82.77% on test data sets recorded in forests in Germany and Poland, respectively. Their best taxonomic classification models reach a validation mAP of 97.18% and mAPs of 94.23% and 79.92% on the two test data sets, respectively.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1162-1192"},"PeriodicalIF":1.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12294","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141683177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised multi-view clustering in computer vision: A survey 计算机视觉中的自监督多视角聚类：调查

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-02 DOI: 10.1049/cvi2.12299

Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng

In recent years, multi-view clustering (MVC) has had significant implications in the fields of cross-modal representation learning and data-driven decision-making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self-supervised learning has made substantial research progress. Consequently, self-supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self-supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self-supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.

近年来，多视图聚类（MVC）在跨模态表征学习和数据驱动决策领域产生了重大影响。其主要目的是利用多视图之间的一致性和互补性信息，将样本聚类为不同的组。然而，计算机视觉领域见证了对比学习的发展，自监督学习也取得了长足的研究进展。因此，自监督学习逐渐成为 MVC 方法的主流。它包括设计代理任务，从图像和视频数据中提取监督信息，从而指导聚类过程。尽管自监督 MVC 发展迅速，但目前还没有一份全面的调查报告来分析和总结当前的研究进展状况。因此，作者旨在通过讨论自监督 MVC 出现的原因和优势，探索自监督 MVC 的出现。此外，作者还研究了常见数据集、数据问题、表示学习方法和自监督学习方法的内在联系和分类。作者不仅介绍了各类方法的机制，还提供了应用实例。最后，作者还指出了一些有待进一步研究和开发的问题。

{"title":"Self-supervised multi-view clustering in computer vision: A survey","authors":"Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng","doi":"10.1049/cvi2.12299","DOIUrl":"https://doi.org/10.1049/cvi2.12299","url":null,"abstract":"In recent years, multi-view clustering (MVC) has had significant implications in the fields of cross-modal representation learning and data-driven decision-making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self-supervised learning has made substantial research progress. Consequently, self-supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self-supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self-supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"709-734"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12299","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fusing crops representation into snippet via mutual learning for weakly supervised surveillance anomaly detection 通过相互学习将作物表征融合到片段中，用于弱监督监控异常检测

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-02 DOI: 10.1049/cvi2.12289

Bohua Zhang, Jianru Xue

In recent years, the challenge of detecting anomalies in real-world surveillance videos using weakly supervised data has emerged. Traditional methods, utilising multi-instance learning (MIL) with video snippets, struggle with background noise and tend to overlook subtle anomalies. To tackle this, the authors propose a novel approach that crops snippets to create multiple instances with less noise, separately evaluates them and then fuses these evaluations for more precise anomaly detection. This method, however, leads to higher computational demands, especially during inference. Addressing this, our solution employs mutual learning to guide snippet feature training using these low-noise crops. The authors integrate multiple instance learning (MIL) for the primary task with snippets as inputs and multiple-multiple instance learning (MMIL) for an auxiliary task with crops during training. The authors’ approach ensures consistent multi-instance results in both tasks and incorporates a temporal activation mutual learning module (TAML) for aligning temporal anomaly activations between snippets and crops, improving the overall quality of snippet representations. Additionally, a snippet feature discrimination enhancement module (SFDE) refines the snippet features further. Tested across various datasets, the authors’ method shows remarkable performance, notably achieving a frame-level AUC of 85.78% on the UCF-Crime dataset, while reducing computational costs.

近年来，利用弱监督数据检测真实世界监控视频中的异常情况成为一项挑战。传统方法利用视频片段进行多实例学习 (MIL)，在背景噪声的影响下举步维艰，往往会忽略细微的异常情况。为了解决这个问题，作者提出了一种新方法，即裁剪视频片段以创建噪声较小的多个实例，分别对它们进行评估，然后将这些评估结果融合起来，以实现更精确的异常检测。然而，这种方法对计算要求较高，尤其是在推理过程中。为了解决这个问题，我们的解决方案采用了相互学习的方法，利用这些低噪音作物指导片段特征训练。作者将多实例学习（MIL）和多-多实例学习（MMIL）相结合，前者用于以片段为输入的主要任务，后者用于在训练期间以作物为输入的辅助任务。作者的方法确保了这两项任务的多实例结果的一致性，并整合了一个时间激活相互学习模块（TAML），用于调整片段和作物之间的时间异常激活，从而提高片段表征的整体质量。此外，片段特征辨别增强模块（SFDE）可进一步完善片段特征。在各种数据集上进行测试后，作者的方法显示出卓越的性能，尤其是在 UCF-Crime 数据集上实现了 85.78% 的帧级 AUC，同时降低了计算成本。

{"title":"Fusing crops representation into snippet via mutual learning for weakly supervised surveillance anomaly detection","authors":"Bohua Zhang, Jianru Xue","doi":"10.1049/cvi2.12289","DOIUrl":"10.1049/cvi2.12289","url":null,"abstract":"In recent years, the challenge of detecting anomalies in real-world surveillance videos using weakly supervised data has emerged. Traditional methods, utilising multi-instance learning (MIL) with video snippets, struggle with background noise and tend to overlook subtle anomalies. To tackle this, the authors propose a novel approach that crops snippets to create multiple instances with less noise, separately evaluates them and then fuses these evaluations for more precise anomaly detection. This method, however, leads to higher computational demands, especially during inference. Addressing this, our solution employs mutual learning to guide snippet feature training using these low-noise crops. The authors integrate multiple instance learning (MIL) for the primary task with snippets as inputs and multiple-multiple instance learning (MMIL) for an auxiliary task with crops during training. The authors’ approach ensures consistent multi-instance results in both tasks and incorporates a temporal activation mutual learning module (TAML) for aligning temporal anomaly activations between snippets and crops, improving the overall quality of snippet representations. Additionally, a snippet feature discrimination enhancement module (SFDE) refines the snippet features further. Tested across various datasets, the authors’ method shows remarkable performance, notably achieving a frame-level AUC of 85.78% on the UCF-Crime dataset, while reducing computational costs.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1112-1126"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12289","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141684297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FastFaceCLIP: A lightweight text-driven high-quality face image manipulation FastFaceCLIP：轻量级文本驱动的高质量人脸图像处理工具

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-02 DOI: 10.1049/cvi2.12295

Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao

Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.

虽然在文本驱动图像领域出现了许多新方法，但模型训练所需的计算能力很大，导致这些方法的训练过程很慢。此外，这些方法在训练过程中会消耗大量视频随机存取存储器（VRAM）资源。在生成高分辨率图像时，VRAM 资源往往不足，导致无法生成高分辨率图像。不过，视觉转换器（ViTs）的最新进展已经证明了其图像分类和识别能力。与传统的基于卷积神经网络的方法不同，ViTs 采用基于变换器的架构，利用注意力机制捕捉全面的全局信息，并通过固有的长距离依赖关系增强对图像的全局理解，从而提取更强大的特征，并在减少计算负荷的情况下实现可比的结果。研究了 ViTs 对文本驱动图像处理的适应性。具体而言，对现有的图像生成方法进行了改进，并通过将预先训练的 CLIP 模型的图像-文本语义配准功能与所提出的 FastFace 的高分辨率图像生成功能相结合，提出了 FastFaceCLIP 方法。此外，还加入了多轴嵌套变换器模块，用于从潜空间进行高级特征提取，生成更高分辨率的图像，并使用 Real-ESRGAN 算法对图像进行进一步增强。最终，在 CelebA-HQ 数据集上进行的大量人脸操作相关测试对所提出的方法和其他相关方案提出了挑战，证明 FastFaceCLIP 能有效地生成语义准确、视觉逼真和清晰的图像，而且参数更少、时间更短。

{"title":"FastFaceCLIP: A lightweight text-driven high-quality face image manipulation","authors":"Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao","doi":"10.1049/cvi2.12295","DOIUrl":"10.1049/cvi2.12295","url":null,"abstract":"Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"950-967"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DualAD: Dual adversarial network for image anomaly detection⋆

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-06-25 DOI: 10.1049/cvi2.12297

Yonghao Wan, Aimin Feng

Anomaly Detection, also known as outlier detection, is critical in domains such as network security, intrusion detection, and fraud detection. One popular approach to anomaly detection is using autoencoders, which are trained to reconstruct input by minimising reconstruction error with the neural network. However, these methods usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. The authors find that the above trade-off can be better mitigated by imposing constraints on the latent space of images. To this end, the authors propose a new Dual Adversarial Network (DualAD) that consists of a Feature Constraint (FC) module and a reconstruction module. The method incorporates the FC module during the reconstruction training process to impose constraints on the latent space of images, thereby yielding feature representations more conducive to anomaly detection. Additionally, the authors employ dual adversarial learning to model the distribution of normal data. On the one hand, adversarial learning was implemented during the reconstruction process to obtain higher-quality reconstruction samples, thereby preventing the effects of blurred image reconstructions on model performance. On the other hand, the authors utilise adversarial training of the FC module and the reconstruction module to achieve superior feature representation, making anomalies more distinguishable at the feature level. During the inference phase, the authors perform anomaly detection simultaneously in the pixel and latent spaces to identify abnormal patterns more comprehensively. Experiments on three data sets CIFAR10, MNIST, and FashionMNIST demonstrate the validity of the authors’ work. Results show that constraints on the latent space and adversarial learning can improve detection performance.

{"title":"DualAD: Dual adversarial network for image anomaly detection⋆","authors":"Yonghao Wan, Aimin Feng","doi":"10.1049/cvi2.12297","DOIUrl":"https://doi.org/10.1049/cvi2.12297","url":null,"abstract":"Anomaly Detection, also known as outlier detection, is critical in domains such as network security, intrusion detection, and fraud detection. One popular approach to anomaly detection is using autoencoders, which are trained to reconstruct input by minimising reconstruction error with the neural network. However, these methods usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. The authors find that the above trade-off can be better mitigated by imposing constraints on the latent space of images. To this end, the authors propose a new Dual Adversarial Network (DualAD) that consists of a Feature Constraint (FC) module and a reconstruction module. The method incorporates the FC module during the reconstruction training process to impose constraints on the latent space of images, thereby yielding feature representations more conducive to anomaly detection. Additionally, the authors employ dual adversarial learning to model the distribution of normal data. On the one hand, adversarial learning was implemented during the reconstruction process to obtain higher-quality reconstruction samples, thereby preventing the effects of blurred image reconstructions on model performance. On the other hand, the authors utilise adversarial training of the FC module and the reconstruction module to achieve superior feature representation, making anomalies more distinguishable at the feature level. During the inference phase, the authors perform anomaly detection simultaneously in the pixel and latent spaces to identify abnormal patterns more comprehensively. Experiments on three data sets CIFAR10, MNIST, and FashionMNIST demonstrate the validity of the authors’ work. Results show that constraints on the latent space and adversarial learning can improve detection performance.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1138-1148"},"PeriodicalIF":1.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0