{"title":"Gaze and filled pause detection for smooth human-robot conversations","authors":"Miriam Bilac, Marine Chamoux, Angelica Lim","doi":"10.1109/HUMANOIDS.2017.8246889","DOIUrl":null,"url":null,"abstract":"Let the human speak! Interactive robots and voice interfaces such as Pepper, Amazon Alexa, and OK Google are becoming more and more popular, allowing for more natural interaction compared to screens or keyboards. One issue with voice interfaces is that they tend to require a “robotic” flow of human speech. Humans must be careful to not produce disfluencies, such as hesitations or extended pauses between words. If they do, the agent may assume that the human has finished their speech turn, and interrupts them mid-thought. Interactive robots often rely on the same limited dialogue technology built for speech interfaces. Yet humanoid robots have the potential to also use their vision systems to determine when the human has finished their speaking turn. In this paper, we introduce HOMAGE (Human-rObot Multimodal Audio and Gaze End-of-turn), a multimodal turntaking system for conversational humanoid robots. We created a dataset of humans spontaneously hesitating when responding to a robot's open-ended questions such as, “What was your favorite moment this year?”. Our analyses found that users produced both auditory filled pauses such as “uhhh”, as well as gaze away from the robot to keep their speaking turn. We then trained a machine learning system to detect the auditory filled pauses and integrated it along with gaze into the Pepper humanoid robot's real-time dialog system. Experiments with 28 naive users revealed that adding auditory filled pause detection and gaze tracking significantly reduced robot interruptions. Furthermore, user turns were 2.1 times longer (without repetitions), suggesting that this strategy allows humans to express themselves more, toward less time pressure and better robot listeners.","PeriodicalId":143992,"journal":{"name":"2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HUMANOIDS.2017.8246889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Let the human speak! Interactive robots and voice interfaces such as Pepper, Amazon Alexa, and OK Google are becoming more and more popular, allowing for more natural interaction compared to screens or keyboards. One issue with voice interfaces is that they tend to require a “robotic” flow of human speech. Humans must be careful to not produce disfluencies, such as hesitations or extended pauses between words. If they do, the agent may assume that the human has finished their speech turn, and interrupts them mid-thought. Interactive robots often rely on the same limited dialogue technology built for speech interfaces. Yet humanoid robots have the potential to also use their vision systems to determine when the human has finished their speaking turn. In this paper, we introduce HOMAGE (Human-rObot Multimodal Audio and Gaze End-of-turn), a multimodal turntaking system for conversational humanoid robots. We created a dataset of humans spontaneously hesitating when responding to a robot's open-ended questions such as, “What was your favorite moment this year?”. Our analyses found that users produced both auditory filled pauses such as “uhhh”, as well as gaze away from the robot to keep their speaking turn. We then trained a machine learning system to detect the auditory filled pauses and integrated it along with gaze into the Pepper humanoid robot's real-time dialog system. Experiments with 28 naive users revealed that adding auditory filled pause detection and gaze tracking significantly reduced robot interruptions. Furthermore, user turns were 2.1 times longer (without repetitions), suggesting that this strategy allows humans to express themselves more, toward less time pressure and better robot listeners.
让人类说话吧!交互式机器人和语音界面(如Pepper、Amazon Alexa和OK Google)正变得越来越受欢迎,与屏幕或键盘相比,它们允许更自然的交互。语音界面的一个问题是,它们往往需要一种“机器人式”的人类语言流。人们必须注意不要产生不流畅,比如单词之间的犹豫或长时间停顿。如果他们这样做,代理可能会认为人类已经完成了他们的演讲,并打断他们的思考。交互式机器人通常依赖于为语音界面构建的同样有限的对话技术。然而,人形机器人也有可能利用它们的视觉系统来确定人类何时完成了他们的讲话。在本文中,我们介绍了HOMAGE (Human-rObot Multimodal Audio and Gaze end -turn),这是一个用于会话类人机器人的多模态轮转系统。我们创建了一个数据集,记录了人类在回答机器人提出的开放式问题时的自发犹豫,比如“你今年最喜欢的时刻是什么?”我们的分析发现,用户既会发出“啊”这样充满听觉的停顿,也会把目光从机器人身上移开,以保持说话的顺序。然后,我们训练了一个机器学习系统来检测充满听觉的停顿,并将其与凝视整合到Pepper人形机器人的实时对话系统中。对28名天真用户的实验表明,添加听觉填充暂停检测和凝视跟踪显著减少了机器人的干扰。此外,用户的回合数增加了2.1倍(没有重复),这表明这种策略可以让人类更多地表达自己,减少时间压力,让机器人更好地倾听。