使用记忆网络作为动态字典从语音中生成不同的手势

2022 International Conference on Culture-Oriented Science and Technology (CoST) Pub Date : 2022-08-01 DOI:10.1109/CoST57098.2022.00042

Zeyu Zhao, Nan Gao, Zhi Zeng, Shuwu Zhanga

{"title":"使用记忆网络作为动态字典从语音中生成不同的手势","authors":"Zeyu Zhao, Nan Gao, Zhi Zeng, Shuwu Zhanga","doi":"10.1109/CoST57098.2022.00042","DOIUrl":null,"url":null,"abstract":"People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.","PeriodicalId":135595,"journal":{"name":"2022 International Conference on Culture-Oriented Science and Technology (CoST)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Generating Diverse Gestures from Speech Using Memory Networks as Dynamic Dictionaries\",\"authors\":\"Zeyu Zhao, Nan Gao, Zhi Zeng, Shuwu Zhanga\",\"doi\":\"10.1109/CoST57098.2022.00042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.\",\"PeriodicalId\":135595,\"journal\":{\"name\":\"2022 International Conference on Culture-Oriented Science and Technology (CoST)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Culture-Oriented Science and Technology (CoST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CoST57098.2022.00042\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Culture-Oriented Science and Technology (CoST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CoST57098.2022.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

人们自然会用肢体动作或手势来加强他们的演讲。由于语音、音频或文本的不确定性，为数字人类或虚拟化身生成人类手势仍然具有挑战性。我们观察到，现有的神经方法通常给出的手势的运动位移量不足，这可以表征为缓慢或沉闷。因此，我们提出了一种新的与记忆网络相结合的生成模型，作为动态字典来生成具有改进多样性的手势。在该模型的框架下，字典网络动态存储先前出现的与文本特征对应的姿势特征，供生成器查找，而姿势生成网络接收音频和姿势特征并输出生成的手势序列。在生成过程中利用种子位姿来保证两个语音段之间的连续性。我们还提出了一种新的客观评价指标来生成手势的多样性，并成功地证明了所提出的模型具有生成具有改进多样性的手势的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Generating Diverse Gestures from Speech Using Memory Networks as Dynamic Dictionaries

People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Culture-Oriented Science and Technology (CoST)

自引率

0.00%

发文量