时空聚合实现高效连续手语识别

IF 5.3 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-04-02 DOI:10.1109/TETCI.2024.3378649
Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng
{"title":"时空聚合实现高效连续手语识别","authors":"Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng","doi":"10.1109/TETCI.2024.3378649","DOIUrl":null,"url":null,"abstract":"Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 6","pages":"3925-3935"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition\",\"authors\":\"Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng\",\"doi\":\"10.1109/TETCI.2024.3378649\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 6\",\"pages\":\"3925-3935\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10488467/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10488467/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

尽管最近连续手语识别(CSLR)取得了进展,但大多数最先进的方法都是逐帧处理输入的手语视频来预测句子。这通常会造成沉重的计算负担,在现实世界中效率低下,甚至不可行。视频本身是冗余的,并非所有帧都是识别所必需的,受此启发,我们提出了空间时间聚合(STAgg)来解决这一问题。具体来说,STAgg 将相邻的相似帧合成为统一的鲁棒表示,然后再输入识别模块,从而大大降低了计算复杂度和内存需求。我们首先详细分析了子采样、最大池化和平均等常用的聚合方法,然后根据预期的设计准则自然地推导出我们的 STAgg。与常用的池化和子样本对应方法相比,大量的消融研究验证了我们提出的三种不同的 STAgg 变体在准确性和效率方面的优越性。最佳版本的准确率与最先进的竞争对手相当,但速度快 1.35 倍,计算成本仅为 0.50 倍,训练时间为 0.70 倍,内存使用率为 0.65 倍。在多个骨干网上对四个大规模数据集进行的实验充分验证了所提出的 STAgg 的通用性和有效性。STAgg 的另一个优势是支持更强大的骨干网,这可能会在类似的计算/内存预算下进一步提高 CSLR 的准确性。我们还将 STAgg 的结果可视化,以支持对 STAgg 效果进行直观、深入的分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition
Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
10.30
自引率
7.50%
发文量
147
期刊介绍: The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.
期刊最新文献
Table of Contents IEEE Transactions on Emerging Topics in Computational Intelligence Publication Information IEEE Transactions on Emerging Topics in Computational Intelligence Information for Authors IEEE Computational Intelligence Society Information Decentralized Triggering and Event-Based Integral Reinforcement Learning for Multiplayer Differential Game Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1