Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction Pub Date : 2015-11-09 DOI:10.1145/2818346.2820750

Najmeh Sadoughi, C. Busso

{"title":"Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors","authors":"Najmeh Sadoughi, C. Busso","doi":"10.1145/2818346.2820750","DOIUrl":null,"url":null,"abstract":"Creating believable behaviors for conversational agents (CAs) is a challenging task, given the complex relationship between speech and various nonverbal behaviors. The two main approaches are rule-based systems, which tend to produce behaviors with limited variations compared to natural interactions, and data-driven systems, which tend to ignore the underlying semantic meaning of the message (e.g., gestures without meaning). We envision a hybrid system, acting as the behavior realization layer in rule-based systems, while exploiting the rich variation in natural interactions. Constrained on a given target gesture (e.g., head nod) and speech signal, the system will generate novel realizations learned from the data, capturing the timely relationship between speech and gestures. An important task in this research is identifying multiple examples of the target gestures in the corpus. This paper proposes a data mining framework for detecting gestures of interest in a motion capture database. First, we train One-class support vector machines (SVMs) to detect candidate segments conveying the target gesture. Second, we use dynamic time alignment kernel (DTAK) to compare the similarity between the examples (i.e., target gesture) and the given segments. We evaluate the approach for five prototypical hand and head gestures showing reasonable performance. These retrieved gestures are then used to train a speech-driven framework based on dynamic Bayesian networks (DBNs) to synthesize these target behaviors.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818346.2820750","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Creating believable behaviors for conversational agents (CAs) is a challenging task, given the complex relationship between speech and various nonverbal behaviors. The two main approaches are rule-based systems, which tend to produce behaviors with limited variations compared to natural interactions, and data-driven systems, which tend to ignore the underlying semantic meaning of the message (e.g., gestures without meaning). We envision a hybrid system, acting as the behavior realization layer in rule-based systems, while exploiting the rich variation in natural interactions. Constrained on a given target gesture (e.g., head nod) and speech signal, the system will generate novel realizations learned from the data, capturing the timely relationship between speech and gestures. An important task in this research is identifying multiple examples of the target gestures in the corpus. This paper proposes a data mining framework for detecting gestures of interest in a motion capture database. First, we train One-class support vector machines (SVMs) to detect candidate segments conveying the target gesture. Second, we use dynamic time alignment kernel (DTAK) to compare the similarity between the examples (i.e., target gesture) and the given segments. We evaluate the approach for five prototypical hand and head gestures showing reasonable performance. These retrieved gestures are then used to train a speech-driven framework based on dynamic Bayesian networks (DBNs) to synthesize these target behaviors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于有意义行为的语音驱动动画目标手势检索

鉴于言语和各种非语言行为之间的复杂关系，为会话代理(ca)创建可信的行为是一项具有挑战性的任务。两种主要的方法是基于规则的系统，它倾向于产生与自然交互相比具有有限变化的行为，以及数据驱动的系统，它倾向于忽略消息的潜在语义(例如，没有意义的手势)。我们设想一个混合系统，作为基于规则的系统中的行为实现层，同时利用自然交互中的丰富变化。在给定目标手势(例如，点头)和语音信号的约束下，系统将从数据中学习产生新的实现，捕获语音和手势之间的及时关系。本研究的一个重要任务是在语料库中识别目标手势的多个例子。本文提出了一种用于动作捕捉数据库中感兴趣手势检测的数据挖掘框架。首先，我们训练一类支持向量机(svm)来检测传递目标手势的候选片段。其次，我们使用动态时间对齐核(DTAK)来比较示例(即目标手势)与给定片段之间的相似性。我们评估了五种表现合理的手部和头部手势原型的方法。然后使用这些检索到的手势来训练基于动态贝叶斯网络(dbn)的语音驱动框架，以合成这些目标行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

自引率

0.00%

发文量