{"title":"VAT:用于细粒度服装人体重建的可视性感知变压器。","authors":"Xiaoyan Zhang;Zibin Zhu;Hong Xie;Sisi Ren;Jianmin Jiang","doi":"10.1109/TVCG.2025.3528021","DOIUrl":null,"url":null,"abstract":"In order to reconstruct 3D clothed human with accurate fine-grained details from sparse views, we propose a deep cooperating two-level global to fine-grained reconstruction framework that constructs robust global geometry to guide fine-grained geometry learning. The core of the framework is a novel visibility aware Transformer VAT, which bridges the two-level reconstruction architecture by connecting its global encoder and fine-grained decoder with two pixel-aligned implicit functions, respectively. The global encoder fuses semantic features of multiple views to integrate global geometric features. In the fine-grained decoder, visibility aware attention mechanism is designed to efficiently fuse multi-view and multi-scale features for mining fine-grained geometric features. The global encoder and fine-grained decoder are connected by a global embeding module to form a deep cooperation in the two-level framework, which provides global geometric embedding as a query guidance for calculating visibility aware attention in the fine-grained decoder. In addition, to extract highly aligned multi-scale features for the two-level reconstruction architecture, we design an image feature extractor MSUNet, which establishes strong semantic connections between different scales at minimal cost. Our proposed framework is end-to-end trainable, with all modules jointly optimized. We validate the effectiveness of our framework on public benchmarks, and experimental results demonstrate that our method has significant advantages over state-of-the-art methods in terms of both fine-grained performance and generalization.","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"31 10","pages":"6719-6736"},"PeriodicalIF":6.5000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VAT: Visibility Aware Transformer for Fine-Grained Clothed Human Reconstruction\",\"authors\":\"Xiaoyan Zhang;Zibin Zhu;Hong Xie;Sisi Ren;Jianmin Jiang\",\"doi\":\"10.1109/TVCG.2025.3528021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to reconstruct 3D clothed human with accurate fine-grained details from sparse views, we propose a deep cooperating two-level global to fine-grained reconstruction framework that constructs robust global geometry to guide fine-grained geometry learning. The core of the framework is a novel visibility aware Transformer VAT, which bridges the two-level reconstruction architecture by connecting its global encoder and fine-grained decoder with two pixel-aligned implicit functions, respectively. The global encoder fuses semantic features of multiple views to integrate global geometric features. In the fine-grained decoder, visibility aware attention mechanism is designed to efficiently fuse multi-view and multi-scale features for mining fine-grained geometric features. The global encoder and fine-grained decoder are connected by a global embeding module to form a deep cooperation in the two-level framework, which provides global geometric embedding as a query guidance for calculating visibility aware attention in the fine-grained decoder. In addition, to extract highly aligned multi-scale features for the two-level reconstruction architecture, we design an image feature extractor MSUNet, which establishes strong semantic connections between different scales at minimal cost. Our proposed framework is end-to-end trainable, with all modules jointly optimized. We validate the effectiveness of our framework on public benchmarks, and experimental results demonstrate that our method has significant advantages over state-of-the-art methods in terms of both fine-grained performance and generalization.\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"31 10\",\"pages\":\"6719-6736\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836818/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10836818/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VAT: Visibility Aware Transformer for Fine-Grained Clothed Human Reconstruction
In order to reconstruct 3D clothed human with accurate fine-grained details from sparse views, we propose a deep cooperating two-level global to fine-grained reconstruction framework that constructs robust global geometry to guide fine-grained geometry learning. The core of the framework is a novel visibility aware Transformer VAT, which bridges the two-level reconstruction architecture by connecting its global encoder and fine-grained decoder with two pixel-aligned implicit functions, respectively. The global encoder fuses semantic features of multiple views to integrate global geometric features. In the fine-grained decoder, visibility aware attention mechanism is designed to efficiently fuse multi-view and multi-scale features for mining fine-grained geometric features. The global encoder and fine-grained decoder are connected by a global embeding module to form a deep cooperation in the two-level framework, which provides global geometric embedding as a query guidance for calculating visibility aware attention in the fine-grained decoder. In addition, to extract highly aligned multi-scale features for the two-level reconstruction architecture, we design an image feature extractor MSUNet, which establishes strong semantic connections between different scales at minimal cost. Our proposed framework is end-to-end trainable, with all modules jointly optimized. We validate the effectiveness of our framework on public benchmarks, and experimental results demonstrate that our method has significant advantages over state-of-the-art methods in terms of both fine-grained performance and generalization.