LS-VIT：基于长短时间差的动作识别视觉转换器。

IF 3.5 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Frontiers in Neurorobotics Pub Date : 2024-10-31 eCollection Date: 2024-01-01 DOI:10.3389/fnbot.2024.1457843

Dong Chen, Peisong Wu, Mingdong Chen, Mengtao Wu, Tao Zhang, Chuanqi Li

{"title":"LS-VIT：基于长短时间差的动作识别视觉转换器。","authors":"Dong Chen, Peisong Wu, Mingdong Chen, Mengtao Wu, Tao Zhang, Chuanqi Li","doi":"10.3389/fnbot.2024.1457843","DOIUrl":null,"url":null,"abstract":"Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1457843"},"PeriodicalIF":3.5000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560894/pdf/","citationCount":"0","resultStr":"{\"title\":\"LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference.\",\"authors\":\"Dong Chen, Peisong Wu, Mingdong Chen, Mengtao Wu, Tao Zhang, Chuanqi Li\",\"doi\":\"10.3389/fnbot.2024.1457843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.\",\"PeriodicalId\":12628,\"journal\":{\"name\":\"Frontiers in Neurorobotics\",\"volume\":\"18 \",\"pages\":\"1457843\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560894/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Neurorobotics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.3389/fnbot.2024.1457843\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Neurorobotics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3389/fnbot.2024.1457843","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在过去几年中，越来越多的研究人员致力于时间建模。基于变换器的方法的出现显著推动了基于二维图像的视觉任务领域的发展。然而，对于三维视频任务（如动作识别）而言，直接对视频数据应用时空变换会大大增加计算和内存需求。资源消耗激增的原因是数据片段的倍增和自我感知计算的复杂性增加。因此，为视频内容建立高效、精确的三维自感知模型是变换器面临的一大挑战。在我们的研究中，我们引入了长短时差视觉变换器（LS-VIT）。这种方法通过对几个连续帧的差值进行加权处理，将短期运动细节纳入图像，从而使原始图像具备了建立短期运动模型的能力。与此同时，我们还集成了一个旨在理解长期运动细节的模块。该模块通过运动激励直接整合来自不同片段的时间差，从而增强了模型的长期运动建模能力。我们的全面分析证实，LS-VIT 在多个基准测试（如 UCF101、HMDB51、Kinetics-400）中都达到了很高的识别准确率。这些研究结果表明，LS-VIT 具有进一步优化的潜力，可以提高实时性能和动作预测能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference.

Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Neurorobotics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCER-ROBOTICS

CiteScore

5.20

自引率

6.50%

发文量

250

审稿时长

14 weeks

期刊介绍： Frontiers in Neurorobotics publishes rigorously peer-reviewed research in the science and technology of embodied autonomous neural systems. Specialty Chief Editors Alois C. Knoll and Florian Röhrbein at the Technische Universität München are supported by an outstanding Editorial Board of international experts. This multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics and the public worldwide. Neural systems include brain-inspired algorithms (e.g. connectionist networks), computational models of biological neural networks (e.g. artificial spiking neural nets, large-scale simulations of neural microcircuits) and actual biological systems (e.g. in vivo and in vitro neural nets). The focus of the journal is the embodiment of such neural systems in artificial software and hardware devices, machines, robots or any other form of physical actuation. This also includes prosthetic devices, brain machine interfaces, wearable systems, micro-machines, furniture, home appliances, as well as systems for managing micro and macro infrastructures. Frontiers in Neurorobotics also aims to publish radically new tools and methods to study plasticity and development of autonomous self-learning systems that are capable of acquiring knowledge in an open-ended manner. Models complemented with experimental studies revealing self-organizing principles of embodied neural systems are welcome. Our journal also publishes on the micro and macro engineering and mechatronics of robotic devices driven by neural systems, as well as studies on the impact that such systems will have on our daily life.