EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Real-Time Image Processing Pub Date : 2024-08-09 DOI:10.1007/s11554-024-01528-3

Xin Zhao, Lianping Yang, Wencong Huang, Qi Wang, Xin Wang, Yantao Lou

{"title":"EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation","authors":"Xin Zhao, Lianping Yang, Wencong Huang, Qi Wang, Xin Wang, Yantao Lou","doi":"10.1007/s11554-024-01528-3","DOIUrl":null,"url":null,"abstract":"<p>Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the <span>\\(\\hbox {MPJPE}_{3D}\\)</span> to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"85 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Real-Time Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11554-024-01528-3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the \(\hbox {MPJPE}_{3D}\) to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

EV-TIFNet：由事件相机时间信息辅助的轻量级双目融合网络，用于三维人体姿态估计

使用 RGB 摄像机进行人体姿态估计时，在运动模糊或光照不佳等具有挑战性的情况下往往会出现性能下降。相比之下，事件相机具有宽动态范围、微秒级时间分辨率、最小延迟和低功耗等特点，在极端视觉环境中表现出卓越的适应性。然而，在目前的研究中，利用事件相机进行姿态估计的方法尚未充分利用事件驱动数据的潜力，提高模型效率仍是一项持续的追求。这项工作的重点是设计一种高效、紧凑的姿态估计算法，并特别关注优化多视角事件流的融合，以提高姿态预测的准确性。我们提出了一种紧凑型双视角交互网络 EV-TIFNet，它将事件帧与我们定制设计的全局时空特征图（GTF 地图）结合在一起。为了提高网络理解运动特征和定位关键点的能力，我们为 GTF 地图量身定制了专用的辅助信息提取模块（AIE 模块）。实验结果表明，我们的模型拥有55万个精简参数，在DHP19数据集上取得了显著进步，将（\hbox {MPJPE}_{3D}\）减小到61.45毫米。基于事件数据的稀疏性，稀疏卷积算子的集成取代了传统卷积层的很大一部分，从而将计算需求降低了 28.3%，总计 8.71 GFLOPs。这些设计选择凸显了该模型在计算资源有限的情况下的适用性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Real-Time Image Processing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

6.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Due to rapid advancements in integrated circuit technology, the rich theoretical results that have been developed by the image and video processing research community are now being increasingly applied in practical systems to solve real-world image and video processing problems. Such systems involve constraints placed not only on their size, cost, and power consumption, but also on the timeliness of the image data processed. Examples of such systems are mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-guided autonomous robots, spectral imaging systems, and many other real-time embedded systems. In these real-time systems, strict timing requirements demand that results are available within a certain interval of time as imposed by the application. It is often the case that an image processing algorithm is developed and proven theoretically sound, presumably with a specific application in mind, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are not fully explored, leaving these critical and usually non-trivial issues for those wishing to employ the algorithm in a real-time system. The Journal of Real-Time Image Processing is intended to bridge the gap between the theory and practice of image processing, serving the greater community of researchers, practicing engineers, and industrial professionals who deal with designing, implementing or utilizing image processing systems which must satisfy real-time design constraints.