Temporal superimposed crossover module for effective continuous sign language

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Vision and Applications Pub Date : 2024-08-19 DOI:10.1007/s00138-024-01595-3

Qidan Zhu, Jing Li, Fei Yuan, Quan Gan

{"title":"Temporal superimposed crossover module for effective continuous sign language","authors":"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan","doi":"10.1007/s00138-024-01595-3","DOIUrl":null,"url":null,"abstract":"<p>The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \"TSCM+2D\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \"TSCM+2D\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01595-3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form "TSCM+2D" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply "TSCM+2D" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

有效连续手语的时空叠加交叉模块

连续手语识别的最终目标是促进特殊人群与正常人之间的交流，这就对模型的实时性和可部署性提出了很高的要求。然而，在以往的 CSLR 研究中，研究人员很少关注这两个特性。本文提出了一种基于时空叠加交叉模块和 ResNet 的新型 CSLR 模型 ResNetT，该模型以时空维度的移动取代了参数化计算，在不增加参数和计算量的情况下高效提取时空特征。ResNetT 能够提高模型的实时性能和可部署性，同时确保其准确性。其核心是我们提出的零参数、零计算模块 TSCM，并将 TSCM 与二维卷积相结合，形成 "TSCM+2D "混合卷积，与其他时空卷积相比，具有强大的时空建模能力、零参数增加和更低的部署成本。此外，我们将 "TSCM+2D "应用于 ResBlock，形成新的 ResBlockT，这是新型 CSLR 模型 ResNetT 的基础。我们在训练该模型时引入了随机梯度停止和多级连接时序分类（CTC）损失，从而减少了训练内存的使用，同时降低了最终识别的词错误率（WER），并将 ResNet 网络从图像分类任务扩展到视频识别任务。此外，本研究还是 CSLR 领域首次仅使用二维卷积来提取手语视频的时空特征，从而实现端到端的识别学习。在两个大规模连续手语数据集上的实验证明了该方法的高效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Machine Vision and Applications 工程技术-工程：电子与电气

CiteScore

6.30

自引率

3.00%

发文量

审稿时长

8.7 months

期刊介绍： Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal. Particular emphasis is placed on engineering and technology aspects of image processing and computer vision. The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.