Deep multimodal-based finger spelling recognition for Thai sign language: a new benchmark and model composition

IF 2.4 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Vision and Applications Pub Date : 2024-05-31 DOI:10.1007/s00138-024-01557-9

Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen

{"title":"Deep multimodal-based finger spelling recognition for Thai sign language: a new benchmark and model composition","authors":"Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1007/s00138-024-01557-9","DOIUrl":null,"url":null,"abstract":"<p>Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"71 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01557-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于深度多模态的泰语手语手指拼写识别：新基准和模型构成

基于视频的手语识别对于改善聋人和重听人的交流至关重要。由于缺乏资源，创建和维护高质量的泰语手语视频数据集具有挑战性。为了解决这个问题，我们对基于深度学习的泰语手指拼写识别系统的设计和开发进行了严格的研究，用 43 位不同手语者的 90 个标准字母组成的新数据集对各种模型进行了评估。我们在分析中研究了三种不同模式的七种深度学习模型：纯视频方法（包括基于 RGB 序列的 CNN-LSTM 和 VGG-LSTM）、人体关节坐标序列（由 LSTM、BiLSTM、GRU 和 Transformer 模型处理）以及骨架分析（使用具有图结构骨架表示的 TGCN）。研究人员在七种情况下对这些模型进行了全面评估，包括单手姿势、单手一、二、三笔动作以及双手姿势与静态和动态点对点交互。研究结果表明，在所有情况下，TGCN 模型都是最佳的轻量级模型。在单手姿势的情况下，两种模式的 Transformer 和 TGCN 模型的组合具有出色的性能，在以下四种特定条件下表现出色：单手姿势、需要一个、两个和三个笔画的单手姿势。相比之下，具有静态或动态点对点交互的双手姿势则面临巨大挑战，因为由于坐标序列数据不足和缺乏详细的骨骼图结构，手部障碍物导致关节坐标数据不足。研究建议将 RGB 序列与视觉模式相结合，以提高双手手语手势的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Machine Vision and Applications 工程技术-工程：电子与电气

CiteScore

6.30

自引率

3.00%

发文量

审稿时长

8.7 months

期刊介绍： Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal. Particular emphasis is placed on engineering and technology aspects of image processing and computer vision. The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.

期刊最新文献

A novel key point based ROI segmentation and image captioning using guidance information Specular Surface Detection with Deep Static Specular Flow and Highlight Removing cloud shadows from ground-based solar imagery Underwater image object detection based on multi-scale feature fusion Object Recognition Consistency in Regression for Active Detection