{"title":"有效连续手语的时空叠加交叉模块","authors":"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan","doi":"10.1007/s00138-024-01595-3","DOIUrl":null,"url":null,"abstract":"<p>The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \"TSCM+2D\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \"TSCM+2D\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Temporal superimposed crossover module for effective continuous sign language\",\"authors\":\"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan\",\"doi\":\"10.1007/s00138-024-01595-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \\\"TSCM+2D\\\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \\\"TSCM+2D\\\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.</p>\",\"PeriodicalId\":51116,\"journal\":{\"name\":\"Machine Vision and Applications\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine Vision and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00138-024-01595-3\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01595-3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Temporal superimposed crossover module for effective continuous sign language
The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form "TSCM+2D" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply "TSCM+2D" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.
期刊介绍:
Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal.
Particular emphasis is placed on engineering and technology aspects of image processing and computer vision.
The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.