基于条件变分自编码器的统一三维人体运动综合模型*

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.01144

Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, N. Thalmann

{"title":"基于条件变分自编码器的统一三维人体运动综合模型*","authors":"Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, N. Thalmann","doi":"10.1109/ICCV48922.2021.01144","DOIUrl":null,"url":null,"abstract":"We present a unified and flexible framework to address the generalized problem of 3D motion synthesis that covers the tasks of motion prediction, completion, interpolation, and spatial-temporal recovery. Since these tasks have different input constraints and various fidelity and diversity requirements, most existing approaches only cater to a specific task or use different architectures to address various tasks. Here we propose a unified framework based on Conditional Variational Auto-Encoder (CVAE), where we treat any arbitrary input as a masked motion series. Notably, by considering this problem as a conditional generation process, we estimate a parametric distribution of the missing regions based on the input conditions, from which to sample and synthesize the full motion series. To further allow the flexibility of manipulating the motion style of the generated series, we design an Action-Adaptive Modulation (AAM) to propagate the given semantic guidance through the whole sequence. We also introduce a cross-attention mechanism to exploit distant relations among decoder and encoder features for better realism and global consistency. We conducted extensive experiments on Human 3.6M and CMU-Mocap. The results show that our method produces coherent and realistic results for various motion synthesis tasks, with the synthesized motions distinctly adapted by the given action labels.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"33 1","pages":"11625-11635"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗\",\"authors\":\"Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, N. Thalmann\",\"doi\":\"10.1109/ICCV48922.2021.01144\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a unified and flexible framework to address the generalized problem of 3D motion synthesis that covers the tasks of motion prediction, completion, interpolation, and spatial-temporal recovery. Since these tasks have different input constraints and various fidelity and diversity requirements, most existing approaches only cater to a specific task or use different architectures to address various tasks. Here we propose a unified framework based on Conditional Variational Auto-Encoder (CVAE), where we treat any arbitrary input as a masked motion series. Notably, by considering this problem as a conditional generation process, we estimate a parametric distribution of the missing regions based on the input conditions, from which to sample and synthesize the full motion series. To further allow the flexibility of manipulating the motion style of the generated series, we design an Action-Adaptive Modulation (AAM) to propagate the given semantic guidance through the whole sequence. We also introduce a cross-attention mechanism to exploit distant relations among decoder and encoder features for better realism and global consistency. We conducted extensive experiments on Human 3.6M and CMU-Mocap. The results show that our method produces coherent and realistic results for various motion synthesis tasks, with the synthesized motions distinctly adapted by the given action labels.\",\"PeriodicalId\":6820,\"journal\":{\"name\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"volume\":\"33 1\",\"pages\":\"11625-11635\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCV48922.2021.01144\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.01144","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

摘要

我们提出了一个统一而灵活的框架来解决3D运动合成的广义问题，该问题涵盖了运动预测，完成，插值和时空恢复的任务。由于这些任务有不同的输入约束和不同的保真度和多样性要求，大多数现有的方法只迎合特定的任务或使用不同的架构来解决不同的任务。在这里，我们提出了一个基于条件变分自编码器(CVAE)的统一框架，其中我们将任意输入视为一个被屏蔽的运动序列。值得注意的是，我们将该问题视为一个条件生成过程，根据输入条件估计缺失区域的参数分布，从中采样和合成完整的运动序列。为了进一步允许操纵生成序列的运动风格的灵活性，我们设计了一个动作自适应调制(AAM)来将给定的语义引导传播到整个序列。我们还引入了一种交叉注意机制来利用解码器和编码器特征之间的远距离关系，以获得更好的真实感和全局一致性。我们对Human 3.6M和CMU-Mocap进行了大量的实验。结果表明，我们的方法对各种运动合成任务产生了连贯和真实的结果，合成的运动明显适应给定的动作标签。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗

We present a unified and flexible framework to address the generalized problem of 3D motion synthesis that covers the tasks of motion prediction, completion, interpolation, and spatial-temporal recovery. Since these tasks have different input constraints and various fidelity and diversity requirements, most existing approaches only cater to a specific task or use different architectures to address various tasks. Here we propose a unified framework based on Conditional Variational Auto-Encoder (CVAE), where we treat any arbitrary input as a masked motion series. Notably, by considering this problem as a conditional generation process, we estimate a parametric distribution of the missing regions based on the input conditions, from which to sample and synthesize the full motion series. To further allow the flexibility of manipulating the motion style of the generated series, we design an Action-Adaptive Modulation (AAM) to propagate the given semantic guidance through the whole sequence. We also introduce a cross-attention mechanism to exploit distant relations among decoder and encoder features for better realism and global consistency. We conducted extensive experiments on Human 3.6M and CMU-Mocap. The results show that our method produces coherent and realistic results for various motion synthesis tasks, with the synthesized motions distinctly adapted by the given action labels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助