Spatiotemporal Convolutions and Video Vision Transformers for Signer-Independent Sign Language Recognition

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Big Data Pub Date : 2023-08-03 DOI:10.1109/icABCD59051.2023.10220534

Mark Marais, Dane Brown, James Connan, Alden Boby

{"title":"Spatiotemporal Convolutions and Video Vision Transformers for Signer-Independent Sign Language Recognition","authors":"Mark Marais, Dane Brown, James Connan, Alden Boby","doi":"10.1109/icABCD59051.2023.10220534","DOIUrl":null,"url":null,"abstract":"Sign language is a vital tool of communication for individuals who are deaf or hard of hearing. Sign language recognition (SLR) technology can assist in bridging the communication gap between deaf and hearing individuals. However, existing SLR systems are typically signer-dependent, requiring training data from the specific signer for accurate recognition. This presents a significant challenge for practical use, as collecting data from every possible signer is not feasible. This research focuses on developing a signer-independent isolated SLR system to address this challenge. The system implements two model variants on the signer-independent datasets: an R(2+ I)D spatiotemporal convolutional block and a Video Vision transformer. These models learn to extract features from raw sign language videos from the LSA64 dataset and classify signs without needing handcrafted features, explicit segmentation or pose estimation. Overall, the R(2+1)D model architecture significantly outperformed the ViViT architecture for signer-independent SLR on the LSA64 dataset. The R(2+1)D model achieved a near-perfect accuracy of 99.53% on the unseen test set, with the ViViT model yielding an accuracy of 72.19 %. Proving that spatiotemporal convolutions are effective at signer-independent SLR.","PeriodicalId":51314,"journal":{"name":"Big Data","volume":"7 1","pages":"1-6"},"PeriodicalIF":2.6000,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/icABCD59051.2023.10220534","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Sign language is a vital tool of communication for individuals who are deaf or hard of hearing. Sign language recognition (SLR) technology can assist in bridging the communication gap between deaf and hearing individuals. However, existing SLR systems are typically signer-dependent, requiring training data from the specific signer for accurate recognition. This presents a significant challenge for practical use, as collecting data from every possible signer is not feasible. This research focuses on developing a signer-independent isolated SLR system to address this challenge. The system implements two model variants on the signer-independent datasets: an R(2+ I)D spatiotemporal convolutional block and a Video Vision transformer. These models learn to extract features from raw sign language videos from the LSA64 dataset and classify signs without needing handcrafted features, explicit segmentation or pose estimation. Overall, the R(2+1)D model architecture significantly outperformed the ViViT architecture for signer-independent SLR on the LSA64 dataset. The R(2+1)D model achieved a near-perfect accuracy of 99.53% on the unseen test set, with the ViViT model yielding an accuracy of 72.19 %. Proving that spatiotemporal convolutions are effective at signer-independent SLR.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于时空卷积和视频视觉变换的独立手语识别

手语是聋人或重听人交流的重要工具。手语识别(SLR)技术可以帮助弥合聋人与正常人之间的沟通差距。然而，现有的单反系统通常依赖于签名者，需要来自特定签名者的训练数据才能准确识别。这对实际使用提出了重大挑战，因为从每个可能的签名者那里收集数据是不可行的。本研究的重点是开发一个独立于签名者的隔离单反系统来解决这一挑战。该系统在签名无关的数据集上实现了两种模型变体:R(2+ I)D时空卷积块和视频视觉变压器。这些模型学习从LSA64数据集的原始手语视频中提取特征，并在不需要手工制作特征、显式分割或姿势估计的情况下对标志进行分类。总体而言，在LSA64数据集上，R(2+1)D模型体系结构在与签名者无关的SLR方面明显优于ViViT体系结构。R(2+1)D模型在未见的测试集上实现了近乎完美的99.53%的准确率，而ViViT模型的准确率为72.19%。证明了时空卷积在与签名无关的单反中是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Big Data COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

9.10

自引率

2.20%

发文量

期刊介绍： Big Data is the leading peer-reviewed journal covering the challenges and opportunities in collecting, analyzing, and disseminating vast amounts of data. The Journal addresses questions surrounding this powerful and growing field of data science and facilitates the efforts of researchers, business managers, analysts, developers, data scientists, physicists, statisticians, infrastructure developers, academics, and policymakers to improve operations, profitability, and communications within their businesses and institutions. Spanning a broad array of disciplines focusing on novel big data technologies, policies, and innovations, the Journal brings together the community to address current challenges and enforce effective efforts to organize, store, disseminate, protect, manipulate, and, most importantly, find the most effective strategies to make this incredible amount of information work to benefit society, industry, academia, and government. Big Data coverage includes: Big data industry standards, New technologies being developed specifically for big data, Data acquisition, cleaning, distribution, and best practices, Data protection, privacy, and policy, Business interests from research to product, The changing role of business intelligence, Visualization and design principles of big data infrastructures, Physical interfaces and robotics, Social networking advantages for Facebook, Twitter, Amazon, Google, etc, Opportunities around big data and how companies can harness it to their advantage.