{"title":"TransWild: Enhancing 3D interacting hands recovery in the wild with IoU-guided Transformer","authors":"Wanru Zhu , Yichen Zhang , Ke Chen , Lihua Guo","doi":"10.1016/j.imavis.2024.105316","DOIUrl":null,"url":null,"abstract":"<div><div>The recovery of 3D interacting hands meshes in the wild (ITW) is crucial for 3D full-body mesh reconstruction, especially when limited 3D annotations are available. The recent ITW interacting hands recovery method brings two hands to a shared 2D scale space and achieves effective learning of ITW datasets. However, they lack the deep exploitation of the intrinsic interaction dynamics of hands. In this work, we propose TransWild, a novel framework for 3D interactive hand mesh recovery that leverages a weight-shared Intersection-of-Union (IoU) guided Transformer for feature interaction. Based on harmonizing ITW and MoCap datasets within a unified 2D scale space, our hand feature interaction mechanism powered by an IoU-guided Transformer enables a more accurate estimation of interacting hands. This innovation stems from the observation that hand detection yields valuable IoU of two hands bounding box, therefore, an IOU-guided Transformer can significantly enrich the Transformer’s ability to decode and integrate these insights into the interactive hand recovery process. To ensure consistent training outcomes, we have developed a strategy for training with augmented ground truth bounding boxes to address inherent variability. Quantitative evaluations across two prominent benchmarks for 3D interacting hands underscore our method’s superior performance. The code will be released after acceptance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105316"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004219","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The recovery of 3D interacting hands meshes in the wild (ITW) is crucial for 3D full-body mesh reconstruction, especially when limited 3D annotations are available. The recent ITW interacting hands recovery method brings two hands to a shared 2D scale space and achieves effective learning of ITW datasets. However, they lack the deep exploitation of the intrinsic interaction dynamics of hands. In this work, we propose TransWild, a novel framework for 3D interactive hand mesh recovery that leverages a weight-shared Intersection-of-Union (IoU) guided Transformer for feature interaction. Based on harmonizing ITW and MoCap datasets within a unified 2D scale space, our hand feature interaction mechanism powered by an IoU-guided Transformer enables a more accurate estimation of interacting hands. This innovation stems from the observation that hand detection yields valuable IoU of two hands bounding box, therefore, an IOU-guided Transformer can significantly enrich the Transformer’s ability to decode and integrate these insights into the interactive hand recovery process. To ensure consistent training outcomes, we have developed a strategy for training with augmented ground truth bounding boxes to address inherent variability. Quantitative evaluations across two prominent benchmarks for 3D interacting hands underscore our method’s superior performance. The code will be released after acceptance.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.