{"title":"Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition","authors":"Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng","doi":"10.1109/TETCI.2024.3378649","DOIUrl":null,"url":null,"abstract":"Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 6","pages":"3925-3935"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10488467/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.