Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Access Pub Date : 2025-01-06 DOI:10.1109/ACCESS.2025.3526330

Zhen Wang;Dongyuan Li;Renhe Jiang;Manabu Okumura

{"title":"Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement","authors":"Zhen Wang;Dongyuan Li;Renhe Jiang;Manabu Okumura","doi":"10.1109/ACCESS.2025.3526330","DOIUrl":null,"url":null,"abstract":"Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"5491-5506"},"PeriodicalIF":3.4000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10829616","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10829616/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.