American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction

IF 3.2 Q1 Computer Science APSIPA Transactions on Signal and Information Processing Pub Date : 2022-01-01 DOI:10.1561/116.00000003

W. Kumwilaisak, Peerawat Pannattee, C. Hansakunbuntheung, N. Thatphithakkul

{"title":"American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction","authors":"W. Kumwilaisak, Peerawat Pannattee, C. Hansakunbuntheung, N. Thatphithakkul","doi":"10.1561/116.00000003","DOIUrl":null,"url":null,"abstract":"This paper proposes a novel method to improve the accuracy of the American Sign Language fingerspelling recognition. Video sequences from the training set of the “ChicagoFSWild” dataset are first utilized for training a deep neural network of weakly supervised learning to generate frame labels from a sequence label automatically. The network of weakly supervised learning contains the AlexNet and the LSTM. This trained network generates a collection of frame-labeled images from the training video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero. The negative and positive pairs of all fingerspelling gestures are randomly formed from the collected image set. These pairs are adopted to train the Siamese network of the ResNet-50 and the projection function to produce efficient feature representations. The trained Resnet-50 and the projection function are concatenated with the bidirectional LSTM, a fully connected layer, and a softmax layer to form a deep neural network for the American Sign Language fingerspelling recognition. With the training video sequences, video frames corresponding to the video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero are added to the collected image set. The updated collected image set is used to train the Siamese network. The training process, from training the Siamese network to the update of the collected image set, is iterated until the image recognition performance is not further enhanced. The experimental results from the “ChicagoFSWild” dataset show that the proposed method surpasses the existing works in terms of the character error rate.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":"1 1","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/116.00000003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

This paper proposes a novel method to improve the accuracy of the American Sign Language fingerspelling recognition. Video sequences from the training set of the “ChicagoFSWild” dataset are first utilized for training a deep neural network of weakly supervised learning to generate frame labels from a sequence label automatically. The network of weakly supervised learning contains the AlexNet and the LSTM. This trained network generates a collection of frame-labeled images from the training video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero. The negative and positive pairs of all fingerspelling gestures are randomly formed from the collected image set. These pairs are adopted to train the Siamese network of the ResNet-50 and the projection function to produce efficient feature representations. The trained Resnet-50 and the projection function are concatenated with the bidirectional LSTM, a fully connected layer, and a softmax layer to form a deep neural network for the American Sign Language fingerspelling recognition. With the training video sequences, video frames corresponding to the video sequences that have Levenshtein distance between the predicted sequence and the sequence label equal to zero are added to the collected image set. The updated collected image set is used to train the Siamese network. The training process, from training the Siamese network to the update of the collected image set, is iterated until the image recognition performance is not further enhanced. The experimental results from the “ChicagoFSWild” dataset show that the proposed method surpasses the existing works in terms of the character error rate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于迭代语言模型构建的野外美国手语拼写识别

本文提出了一种提高美国手语手指拼写识别准确率的新方法。首先利用“ChicagoFSWild”数据集训练集的视频序列训练弱监督学习的深度神经网络，从序列标签自动生成帧标签。弱监督学习网络包含AlexNet和LSTM。该训练网络从预测序列与序列标签之间的Levenshtein距离为零的训练视频序列中生成一组帧标记图像。从收集到的图像集中随机生成所有拼写手势的正负对。利用这些对对ResNet-50的暹罗网络和投影函数进行训练，得到有效的特征表示。将训练好的Resnet-50和投影函数与双向LSTM、全连接层和softmax层进行连接，形成用于美国手语指纹拼写识别的深度神经网络。对于训练视频序列，将预测序列与序列标签之间Levenshtein距离为零的视频序列对应的视频帧加入到采集到的图像集中。更新后的图像集用于训练Siamese网络。训练过程，从训练Siamese网络到更新所收集的图像集，不断迭代，直到图像识别性能没有进一步提高。在“chicagoofswild”数据集上的实验结果表明，本文提出的方法在字符错误率方面优于现有的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊