Pattern Recognition Letters最新文献_第8页

Visual speech recognition using compact hypercomplex neural networks 使用紧凑超复杂神经网络进行视觉语音识别

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-03 DOI: 10.1016/j.patrec.2024.09.002

Iason Ioannis Panagos , Giorgos Sfikas , Christophoros Nikou

Recent progress in visual speech recognition systems due to advances in deep learning and large-scale public datasets has led to impressive performance compared to human professionals. The potential applications of these systems in real-life scenarios are numerous and can greatly benefit the lives of many individuals. However, most of these systems are not designed with practicality in mind, requiring large-size models and powerful hardware, factors which limit their applicability in resource-constrained environments and other real-world tasks. In addition, few works focus on developing lightweight systems that can be deployed in such conditions. Considering these issues, we propose compact networks that take advantage of hypercomplex layers that utilize a sum of Kronecker products to reduce overall parameter demands and model sizes. We train and evaluate our proposed models on the largest public dataset for single word speech recognition for English. Our experiments show that high compression rates are achievable with a minimal accuracy drop, indicating the method’s potential for practical applications in lower-resource environments. Code and models are available at https://github.com/jpanagos/vsr_phm.

由于深度学习和大规模公共数据集的进步，视觉语音识别系统取得了最新进展，与人类专业人员相比，其性能令人印象深刻。这些系统在现实生活中的潜在应用不胜枚举，可以极大地改善许多人的生活。然而，这些系统在设计时大多没有考虑到实用性，需要大型模型和功能强大的硬件，这些因素限制了它们在资源有限的环境和其他实际任务中的适用性。此外，很少有研究致力于开发可在这种条件下部署的轻量级系统。考虑到这些问题，我们提出了紧凑型网络，利用超复杂层的优势，利用克罗内克乘积之和来减少整体参数需求和模型大小。我们在最大的英语单词语音识别公共数据集上训练和评估了我们提出的模型。我们的实验表明，在准确率下降很小的情况下，可以实现很高的压缩率，这表明该方法在资源较少的环境中具有实际应用的潜力。代码和模型可在 https://github.com/jpanagos/vsr_phm 上获取。

{"title":"Visual speech recognition using compact hypercomplex neural networks","authors":"Iason Ioannis Panagos , Giorgos Sfikas , Christophoros Nikou","doi":"10.1016/j.patrec.2024.09.002","DOIUrl":"10.1016/j.patrec.2024.09.002","url":null,"abstract":"<div><p>Recent progress in visual speech recognition systems due to advances in deep learning and large-scale public datasets has led to impressive performance compared to human professionals. The potential applications of these systems in real-life scenarios are numerous and can greatly benefit the lives of many individuals. However, most of these systems are not designed with practicality in mind, requiring large-size models and powerful hardware, factors which limit their applicability in resource-constrained environments and other real-world tasks. In addition, few works focus on developing lightweight systems that can be deployed in such conditions. Considering these issues, we propose compact networks that take advantage of hypercomplex layers that utilize a sum of Kronecker products to reduce overall parameter demands and model sizes. We train and evaluate our proposed models on the largest public dataset for single word speech recognition for English. Our experiments show that high compression rates are achievable with a minimal accuracy drop, indicating the method’s potential for practical applications in lower-resource environments. Code and models are available at <span><span>https://github.com/jpanagos/vsr_phm</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 1-7"},"PeriodicalIF":3.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A method for evaluating deep generative models of images for hallucinations in high-order spatial context 评估高阶空间背景下幻觉图像深度生成模型的方法

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-02 DOI: 10.1016/j.patrec.2024.08.023

Rucha Deshpande , Mark A. Anastasio , Frank J. Brooks

Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous per image errors, i.e., hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.

深度生成模型（DGM）有可能彻底改变成像诊断。生成式对抗网络（GAN）是一种被广泛应用的 DGM。在关键任务应用中部署任何类型的 DGM 的首要问题是缺乏适当和/或自动的方法来评估生成图像的特定领域质量。在这项工作中，我们展示了对两种流行的 DGM 所输出图像进行的几种客观且可人为解读的测试。这些测试有两个目的(i) 排除适用于下游特定领域应用的 DGM，(ii) 量化 DGM 生成的图像在预期空间环境中出现的幻觉。所设计的数据集是公开的，所建议的测试也可以作为基准，并有助于新兴 DGM 的原型开发。虽然这些测试是在 GANs 上进行的，但它们可以用作评估任何 DGM 的基准。具体来说，我们设计了几种不同图像特征的随机上下文模型（SCM），可以在训练有素的 DGM 生成后进行恢复。这些随机上下文模型共同将特征编码为每幅图像在流行度、位置、强度和/或纹理方面的约束条件。其中一些特征是高阶算法像素排列规则，不容易用协方差矩阵表示。我们设计并验证了统计分类器，以检测已知排列规则的特定效果。然后，我们测试了两种不同的 DGM 在各种训练场景和特征类相似程度下正确再现特征上下文的比率。我们发现，生成的图像集合可以在视觉上显示出很大程度的准确性，并且在集合测量中显示出很高的准确性，但却没有显示出已知的空间排列。我们的主要结论是，可以设计单片机并将其作为基准，以量化可能无法在集合统计中捕捉到、但可能会影响后续使用 DGM 生成的图像的众多单个图像错误（即幻觉）。

{"title":"A method for evaluating deep generative models of images for hallucinations in high-order spatial context","authors":"Rucha Deshpande , Mark A. Anastasio , Frank J. Brooks","doi":"10.1016/j.patrec.2024.08.023","DOIUrl":"10.1016/j.patrec.2024.08.023","url":null,"abstract":"<div><p>Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous <em>per image</em> errors, <em>i.e.</em>, hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 23-29"},"PeriodicalIF":3.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002551/pdfft?md5=5df7937160b427d56d6a3c847ac5fdfc&pid=1-s2.0-S0167865524002551-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the special section “Advances trends of pattern recognition for intelligent systems applications” (SS:ISPR23) 特别单元 "智能系统应用模式识别的进展趋势"（SS:ISPR23）介绍

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.08.005

Akram Bennour , Tolga Ensari , Mohammed Al-Shabi

引用次数: 0

A lightweight attention-driven distillation model for human pose estimation 用于人类姿势估计的轻量级注意力驱动蒸馏模型

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.08.009

Falai Wei, Xiaofang Hu

Currently, research on human pose estimation tasks primarily focuses on heatmap-based and regression-based methods. However, the increasing complexity of heatmap models and the low accuracy of regression methods are becoming significant barriers to the advancement of the field. In recent years, researchers have begun exploring new methods to transfer knowledge from heatmap models to regression models. Recognizing the limitations of existing approaches, our study introduces a novel distillation model that is both lightweight and precise. In the feature extraction phase, we design the Channel-Attention-Unit (CAU), which integrates group convolution with an attention mechanism to effectively reduce redundancy while maintaining model accuracy with a decreased parameter count. During distillation, we develop the attention loss function, $L_{A}$ , which enhances the model’s capacity to locate key points quickly and accurately, emulating the effect of additional transformer layers and boosting precision without the need for increased parameters or network depth. Specifically, on the CrowdPose test dataset, our model achieves 71.7% mAP with 4.3M parameters, 2.2 GFLOPs, and 51.3 FPS. Experimental results demonstrates the model’s strong capabilities in both accuracy and efficiency, making it a viable option for real-time posture estimation tasks in real-world environments.

目前，有关人体姿态估计任务的研究主要集中在基于热图和回归的方法上。然而，热图模型的日益复杂性和回归方法的低准确性正成为该领域发展的重大障碍。近年来，研究人员开始探索从热图模型向回归模型转移知识的新方法。认识到现有方法的局限性，我们的研究引入了一种既轻便又精确的新型蒸馏模型。在特征提取阶段，我们设计了通道-注意力单元（CAU），它将群卷积与注意力机制整合在一起，有效减少了冗余，同时在减少参数数量的情况下保持了模型的准确性。在蒸馏过程中，我们开发了注意力损失函数 LA，该函数增强了模型快速、准确定位关键点的能力，模拟了额外变压器层的效果，并在无需增加参数或网络深度的情况下提高了精度。具体来说，在 CrowdPose 测试数据集上，我们的模型在 4.3M 参数、2.2 GFLOPs 和 51.3 FPS 的条件下实现了 71.7% 的 mAP。实验结果表明，该模型在准确性和效率方面都具有很强的能力，使其成为现实环境中实时姿态估计任务的可行选择。

{"title":"A lightweight attention-driven distillation model for human pose estimation","authors":"Falai Wei, Xiaofang Hu","doi":"10.1016/j.patrec.2024.08.009","DOIUrl":"10.1016/j.patrec.2024.08.009","url":null,"abstract":"<div><p>Currently, research on human pose estimation tasks primarily focuses on heatmap-based and regression-based methods. However, the increasing complexity of heatmap models and the low accuracy of regression methods are becoming significant barriers to the advancement of the field. In recent years, researchers have begun exploring new methods to transfer knowledge from heatmap models to regression models. Recognizing the limitations of existing approaches, our study introduces a novel distillation model that is both lightweight and precise. In the feature extraction phase, we design the Channel-Attention-Unit (CAU), which integrates group convolution with an attention mechanism to effectively reduce redundancy while maintaining model accuracy with a decreased parameter count. During distillation, we develop the attention loss function, <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>A</mi></mrow></msub></math></span>, which enhances the model’s capacity to locate key points quickly and accurately, emulating the effect of additional transformer layers and boosting precision without the need for increased parameters or network depth. Specifically, on the CrowdPose test dataset, our model achieves 71.7% mAP with 4.3M parameters, 2.2 GFLOPs, and 51.3 FPS. Experimental results demonstrates the model’s strong capabilities in both accuracy and efficiency, making it a viable option for real-time posture estimation tasks in real-world environments.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 247-253"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A semantic guidance-based fusion network for multi-label image classification 基于语义引导的多标签图像分类融合网络

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.08.020

Jiuhang Wang , Hongying Tang , Shanshan Luo , Liqi Yang , Shusheng Liu , Aoping Hong , Baoqing Li

Multi-label image classification (MLIC), a fundamental task assigning multiple labels to each image, has been seen notable progress in recent years. Considering simultaneous appearances of objects in the physical world, modeling object correlations is crucial for enhancing classification accuracy. This involves accounting for spatial image feature correlation and label semantic correlation. However, existing methods struggle to establish these correlations due to complex spatial location and label semantic relationships. On the other hand, regarding the fusion of image feature relevance and label semantic relevance, existing methods typically learn a semantic representation in the final CNN layer to combine spatial and label semantic correlations. However, different CNN layers capture features at diverse scales and possess distinct discriminative abilities. To address these issues, in this paper we introduce the Semantic Guidance-Based Fusion Network (SGFN) for MLIC. To model spatial image feature correlation, we leverage the advanced TResNet architecture as the backbone network and employ the Feature Aggregation Module for capturing global spatial correlation. For label semantic correlation, we establish both local and global semantic correlation. We further enrich model features by learning semantic representations across multiple convolutional layers. Our method outperforms current state-of-the-art techniques on PASCAL VOC (2007, 2012) and MS-COCO datasets.

多标签图像分类（MLIC）是一项为每幅图像分配多个标签的基本任务，近年来取得了显著进展。考虑到物理世界中物体的同时出现，建立物体相关性模型对于提高分类准确性至关重要。这就需要考虑空间图像特征相关性和标签语义相关性。然而，由于空间位置和标签语义关系复杂，现有方法很难建立这些相关性。另一方面，关于图像特征相关性和标签语义相关性的融合，现有方法通常在最后的 CNN 层学习语义表示，以结合空间和标签语义相关性。然而，不同的 CNN 层捕捉不同尺度的特征，并具有不同的判别能力。为了解决这些问题，我们在本文中为 MLIC 引入了基于语义引导的融合网络（SGFN）。为了建立空间图像特征相关性模型，我们利用先进的 TResNet 架构作为骨干网络，并采用特征聚合模块来捕捉全局空间相关性。对于标签语义相关性，我们建立了局部和全局语义相关性。我们通过学习多个卷积层的语义表征来进一步丰富模型特征。在 PASCAL VOC（2007 年，2012 年）和 MS-COCO 数据集上，我们的方法优于目前最先进的技术。

{"title":"A semantic guidance-based fusion network for multi-label image classification","authors":"Jiuhang Wang , Hongying Tang , Shanshan Luo , Liqi Yang , Shusheng Liu , Aoping Hong , Baoqing Li","doi":"10.1016/j.patrec.2024.08.020","DOIUrl":"10.1016/j.patrec.2024.08.020","url":null,"abstract":"<div><p>Multi-label image classification (MLIC), a fundamental task assigning multiple labels to each image, has been seen notable progress in recent years. Considering simultaneous appearances of objects in the physical world, modeling object correlations is crucial for enhancing classification accuracy. This involves accounting for spatial image feature correlation and label semantic correlation. However, existing methods struggle to establish these correlations due to complex spatial location and label semantic relationships. On the other hand, regarding the fusion of image feature relevance and label semantic relevance, existing methods typically learn a semantic representation in the final CNN layer to combine spatial and label semantic correlations. However, different CNN layers capture features at diverse scales and possess distinct discriminative abilities. To address these issues, in this paper we introduce the Semantic Guidance-Based Fusion Network (SGFN) for MLIC. To model spatial image feature correlation, we leverage the advanced TResNet architecture as the backbone network and employ the Feature Aggregation Module for capturing global spatial correlation. For label semantic correlation, we establish both local and global semantic correlation. We further enrich model features by learning semantic representations across multiple convolutional layers. Our method outperforms current state-of-the-art techniques on PASCAL VOC (2007, 2012) and MS-COCO datasets.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 254-261"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating neural architectures from parameter spaces for multi-agent reinforcement learning 从多代理强化学习的参数空间生成神经架构

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.07.013

Corentin Artaud, Varuna De-Silva, Rafael Pina, Xiyu Shi

We explore a data-driven approach to generating neural network parameters to determine whether generative models can capture the underlying distribution of a collection of neural network checkpoints. We compile a dataset of checkpoints from neural networks trained within the multi-agent reinforcement learning framework, thus potentially producing previously unseen combinations of neural network parameters. In particular, our generative model is a conditional transformer-based variational autoencoder that, when provided with random noise and a specified performance metric – in our context, returns – predicts the appropriate distribution over the parameter space to achieve the desired performance metric. Our method successfully generates parameters for a specified optimal return without further fine-tuning. We also show that the parameters generated using this approach are more constrained and less variable and, most importantly, perform on par with those trained directly under the multi-agent reinforcement learning framework. We test our method on the neural network architectures commonly employed in the most advanced state-of-the-art algorithms.

我们探索了一种数据驱动的神经网络参数生成方法，以确定生成模型能否捕捉神经网络检查点集合的基本分布。我们汇编了在多代理强化学习框架内训练的神经网络的检查点数据集，从而有可能产生以前从未见过的神经网络参数组合。特别是，我们的生成模型是一个基于条件变换器的变分自动编码器，当提供随机噪声和一个指定的性能指标（在我们的语境中是回报率）时，它会预测参数空间的适当分布，以实现所需的性能指标。我们的方法无需进一步微调，就能成功生成指定最优回报率的参数。我们还表明，使用这种方法生成的参数约束性更强、可变性更小，最重要的是，其性能与在多代理强化学习框架下直接训练的参数相当。我们在最先进算法中常用的神经网络架构上测试了我们的方法。

{"title":"Generating neural architectures from parameter spaces for multi-agent reinforcement learning","authors":"Corentin Artaud, Varuna De-Silva, Rafael Pina, Xiyu Shi","doi":"10.1016/j.patrec.2024.07.013","DOIUrl":"10.1016/j.patrec.2024.07.013","url":null,"abstract":"<div><p>We explore a data-driven approach to generating neural network parameters to determine whether generative models can capture the underlying distribution of a collection of neural network checkpoints. We compile a dataset of checkpoints from neural networks trained within the multi-agent reinforcement learning framework, thus potentially producing previously unseen combinations of neural network parameters. In particular, our generative model is a conditional transformer-based variational autoencoder that, when provided with random noise and a specified performance metric – in our context, <em>returns</em> – predicts the appropriate distribution over the parameter space to achieve the desired performance metric. Our method successfully generates parameters for a specified optimal return without further fine-tuning. We also show that the parameters generated using this approach are more constrained and less variable and, most importantly, perform on par with those trained directly under the multi-agent reinforcement learning framework. We test our method on the neural network architectures commonly employed in the most advanced state-of-the-art algorithms.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 272-278"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002162/pdfft?md5=9d36e1cb3980d40cb66497131a82ff52&pid=1-s2.0-S0167865524002162-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141845216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An unsupervised video anomaly detection method via Optical Flow decomposition and Spatio-Temporal feature learning 通过光流分解和时空特征学习的无监督视频异常检测方法

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.08.013

Jin Fan , Yuxiang Ji , Huifeng Wu , Yan Ge , Danfeng Sun , Jia Wu

The purpose of this paper is to present an unsupervised video anomaly detection method using Optical Flow decomposition and Spatio-Temporal feature learning (OFST). This method employs a combination of optical flow reconstruction and video frame prediction to achieve satisfactory results. The proposed OFST framework is composed of two modules: the Multi-Granularity Memory-augmented Autoencoder with Optical Flow Decomposition (MG-MemAE-OFD) and a Two-Stream Network based on Spatio-Temporal feature learning (TSN-ST). The MG-MemAE-OFD module is composed of three functional blocks: optical flow decomposition, autoencoder, and multi-granularity memory networks. The optical flow decomposition block is used to extract the main motion information of objects in optical flow, and the granularity memory network is utilized to memorize normal patterns and improve the quality of the reconstructions. To predict video frames, we introduce a two-stream network based on spatiotemporal feature learning (TSN-ST), which adopts parallel standard Transformer blocks and a temporal block to learn spatiotemporal features from video frames and optical flows. The OFST combines these two modules so that the prediction error of abnormal samples is further increased due to the larger reconstruction error. In contrast, the normal samples obtain a lower reconstruction error and prediction error. Therefore, the anomaly detection capability of the method is greatly enhanced. Our proposed model was evaluated on public datasets. Specifically, in terms of the area under the curve (AUC), our model achieved an accuracy of 85.74% on the Ped1 dataset, 99.62% on the Ped2 dataset, 93.89% on the Avenue dataset, and 76.0% on the ShanghaiTech Dataset. Our experimental results show an average improvement of 1.2% compared to the current state-of-the-art.

本文旨在介绍一种使用光流分解和时空特征学习（OFST）的无监督视频异常检测方法。该方法将光流重构和视频帧预测相结合，取得了令人满意的效果。所提出的 OFST 框架由两个模块组成：具有光流分解功能的多粒度内存增强自动编码器（MG-MemAE-OFD）和基于时空特征学习的双流网络（TSN-ST）。MG-MemAE-OFD 模块由三个功能模块组成：光流分解、自动编码器和多粒度存储网络。光流分解模块用于提取光流中物体的主要运动信息，粒度记忆网络用于记忆正常模式并提高重建质量。为了预测视频帧，我们引入了基于时空特征学习的双流网络（TSN-ST），它采用并行的标准变换器块和时序块，从视频帧和光流中学习时空特征。OFST 将这两个模块结合在一起，由于重建误差较大，异常样本的预测误差会进一步增大。相比之下，正常样本的重构误差和预测误差较小。因此，该方法的异常检测能力大大增强。我们提出的模型在公共数据集上进行了评估。具体来说，就曲线下面积（AUC）而言，我们的模型在 Ped1 数据集上的准确率为 85.74%，在 Ped2 数据集上的准确率为 99.62%，在 Avenue 数据集上的准确率为 93.89%，在 ShanghaiTech 数据集上的准确率为 76.0%。实验结果表明，与目前最先进的技术相比，我们的模型平均提高了 1.2%。

{"title":"An unsupervised video anomaly detection method via Optical Flow decomposition and Spatio-Temporal feature learning","authors":"Jin Fan , Yuxiang Ji , Huifeng Wu , Yan Ge , Danfeng Sun , Jia Wu","doi":"10.1016/j.patrec.2024.08.013","DOIUrl":"10.1016/j.patrec.2024.08.013","url":null,"abstract":"<div><p>The purpose of this paper is to present an unsupervised video anomaly detection method using Optical Flow decomposition and Spatio-Temporal feature learning (OFST). This method employs a combination of optical flow reconstruction and video frame prediction to achieve satisfactory results. The proposed OFST framework is composed of two modules: the Multi-Granularity Memory-augmented Autoencoder with Optical Flow Decomposition (MG-MemAE-OFD) and a Two-Stream Network based on Spatio-Temporal feature learning (TSN-ST). The MG-MemAE-OFD module is composed of three functional blocks: optical flow decomposition, autoencoder, and multi-granularity memory networks. The optical flow decomposition block is used to extract the main motion information of objects in optical flow, and the granularity memory network is utilized to memorize normal patterns and improve the quality of the reconstructions. To predict video frames, we introduce a two-stream network based on spatiotemporal feature learning (TSN-ST), which adopts parallel standard Transformer blocks and a temporal block to learn spatiotemporal features from video frames and optical flows. The OFST combines these two modules so that the prediction error of abnormal samples is further increased due to the larger reconstruction error. In contrast, the normal samples obtain a lower reconstruction error and prediction error. Therefore, the anomaly detection capability of the method is greatly enhanced. Our proposed model was evaluated on public datasets. Specifically, in terms of the area under the curve (AUC), our model achieved an accuracy of 85.74% on the Ped1 dataset, 99.62% on the Ped2 dataset, 93.89% on the Avenue dataset, and 76.0% on the ShanghaiTech Dataset. Our experimental results show an average improvement of 1.2% compared to the current state-of-the-art.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 239-246"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recent Advances in Deep Learning Model Security 深度学习模型安全性的最新进展

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-09-01 DOI: 10.1016/j.patrec.2024.08.018

Guorui Feng , Sheng Li , Jian Zhao , Zheng Wang

引用次数: 0

Contrastive representation enhancement and learning for handwritten mathematical expression recognition 手写数学表达式识别的对比表示增强和学习

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-08-30 DOI: 10.1016/j.patrec.2024.08.021

Zihao Lin , Jinrong Li , Gang Dai , Tianshui Chen , Shuangping Huang , Jianmin Lin

Handwritten mathematical expression recognition (HMER) is an appealing task due to its wide applications and research challenges. Previous deep learning-based methods used string decoder to emphasize on expression symbol awareness and achieved considerable recognition performance. However, these methods still meet an obstacle in recognizing handwritten symbols with varying appearance, in which huge appearance variations significantly lead to the ambiguity of symbol representation. To this end, our intuition is to employ printed expressions with unified appearance to serve as the template of handwritten expressions, alleviating the effects brought by varying symbol appearance. In this paper, we propose a contrastive learning method, where handwritten symbols with identical semantic are clustered together through the guidance of printed symbols, leading model to enhance the robustness of symbol semantic representations. Specifically, we propose an anchor generation scheme to obtain printed expression images corresponding with handwritten expressions. We propose a contrastive learning objective, termed Semantic-NCE Loss, to pull together printed and handwritten symbols with identical semantic. Moreover, we employ a string decoder to parse the calibrated semantic representations, outputting satisfactory expression symbols. The experiment results on benchmark datasets CROHME 14/16/19 demonstrate that our method noticeably improves recognition accuracy of handwritten expressions and outperforms the standard string decoder methods.

手写数学表达式识别（HMER）因其广泛的应用和研究挑战而成为一项极具吸引力的任务。以往基于深度学习的方法使用字符串解码器来强调表达符号感知，并取得了可观的识别性能。然而，这些方法在识别具有不同外观的手写符号时仍会遇到障碍，其中巨大的外观变化会显著导致符号表示的模糊性。为此，我们的直觉是采用具有统一外观的印刷表达作为手写表达的模板，以减轻符号外观变化带来的影响。在本文中，我们提出了一种对比学习方法，即通过印刷符号的引导，将语义相同的手写符号聚类在一起，从而引导模型增强符号语义表征的鲁棒性。具体来说，我们提出了一种锚生成方案，以获得与手写表情相对应的印刷表情图像。我们提出了一种对比学习目标（称为语义-NCE损失），将具有相同语义的印刷符号和手写符号放在一起。此外，我们还采用了字符串解码器来解析校准后的语义表示，从而输出令人满意的表情符号。在基准数据集 CROHME 14/16/19 上的实验结果表明，我们的方法明显提高了手写表情的识别准确率，并优于标准字符串解码器方法。

{"title":"Contrastive representation enhancement and learning for handwritten mathematical expression recognition","authors":"Zihao Lin , Jinrong Li , Gang Dai , Tianshui Chen , Shuangping Huang , Jianmin Lin","doi":"10.1016/j.patrec.2024.08.021","DOIUrl":"10.1016/j.patrec.2024.08.021","url":null,"abstract":"<div><p>Handwritten mathematical expression recognition (HMER) is an appealing task due to its wide applications and research challenges. Previous deep learning-based methods used string decoder to emphasize on expression symbol awareness and achieved considerable recognition performance. However, these methods still meet an obstacle in recognizing handwritten symbols with varying appearance, in which huge appearance variations significantly lead to the ambiguity of symbol representation. To this end, our intuition is to employ printed expressions with unified appearance to serve as the template of handwritten expressions, alleviating the effects brought by varying symbol appearance. In this paper, we propose a contrastive learning method, where handwritten symbols with identical semantic are clustered together through the guidance of printed symbols, leading model to enhance the robustness of symbol semantic representations. Specifically, we propose an anchor generation scheme to obtain printed expression images corresponding with handwritten expressions. We propose a contrastive learning objective, termed Semantic-NCE Loss, to pull together printed and handwritten symbols with identical semantic. Moreover, we employ a string decoder to parse the calibrated semantic representations, outputting satisfactory expression symbols. The experiment results on benchmark datasets CROHME 14/16/19 demonstrate that our method noticeably improves recognition accuracy of handwritten expressions and outperforms the standard string decoder methods.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 14-20"},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142147916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Polynomial kernel learning for interpolation kernel machines with application to graph classification 应用于图形分类的插值内核机器的多项式内核学习

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters

Pub Date : 2024-08-30 DOI: 10.1016/j.patrec.2024.08.022

Jiaqi Zhang , Cheng-Lin Liu , Xiaoyi Jiang

Since all training data is interpolated, interpolating classifiers have zero training error. However, recent work provides compelling reasons to investigate these classifiers, including their significance for ensemble methods. Interpolation kernel machines, which belong to the class of interpolating classifiers, are capable of good generalization and have proven to be an effective substitute for support vector machines, particularly for graph classification. In this work, we further enhance their performance by studying multiple kernel learning. To this end, we propose a general scheme of polynomial combined kernel functions, employing both quadratic and cubic kernel combinations in our experimental work. Our findings demonstrate that this approach improves performance compared to individual graph kernels. Our work supports the use of interpolation kernel machines as an alternative to support vector machines, thereby contributing to greater methodological diversity.

由于所有训练数据都是内插的，因此内插分类器的训练误差为零。然而，最近的研究为研究这些分类器提供了令人信服的理由，包括它们对集合方法的重要性。插值内核机属于插值分类器，具有良好的泛化能力，已被证明可以有效替代支持向量机，尤其是在图分类方面。在这项工作中，我们通过研究多核学习进一步提高了它们的性能。为此，我们提出了一种多项式组合核函数的通用方案，并在实验工作中采用了二次核和三次核组合。我们的研究结果表明，与单个图形内核相比，这种方法提高了性能。我们的工作支持使用插值内核机来替代支持向量机，从而促进方法的多样性。

{"title":"Polynomial kernel learning for interpolation kernel machines with application to graph classification","authors":"Jiaqi Zhang , Cheng-Lin Liu , Xiaoyi Jiang","doi":"10.1016/j.patrec.2024.08.022","DOIUrl":"10.1016/j.patrec.2024.08.022","url":null,"abstract":"<div><p>Since all training data is interpolated, interpolating classifiers have zero training error. However, recent work provides compelling reasons to investigate these classifiers, including their significance for ensemble methods. Interpolation kernel machines, which belong to the class of interpolating classifiers, are capable of good generalization and have proven to be an effective substitute for support vector machines, particularly for graph classification. In this work, we further enhance their performance by studying multiple kernel learning. To this end, we propose a general scheme of polynomial combined kernel functions, employing both quadratic and cubic kernel combinations in our experimental work. Our findings demonstrate that this approach improves performance compared to individual graph kernels. Our work supports the use of interpolation kernel machines as an alternative to support vector machines, thereby contributing to greater methodological diversity.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 7-13"},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016786552400254X/pdfft?md5=19d4b401347029bc4e40d7a753b1f93a&pid=1-s2.0-S016786552400254X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0