Misato Yatsushiro, Naoya Ikeda, Yuki Hayashi, Y. Nakano
With a goal of contributing to multiparty conversation management, this paper proposes a mechanism for estimating conversational dominance in group interaction. Based on our corpus analysis, we have already established a regression model for dominance estimation using speech and gaze information. In this study, we implement the model as a dominance estimation mechanism, and propose an idea of utilizing the mechanism in moderating multiparty conversations between a conversational robot and three human users. The system decides whom the system should talk to based on the dominance level of each user.
{"title":"A dominance estimation mechanism using eye-gaze and turn-taking information","authors":"Misato Yatsushiro, Naoya Ikeda, Yuki Hayashi, Y. Nakano","doi":"10.1145/2535948.2535956","DOIUrl":"https://doi.org/10.1145/2535948.2535956","url":null,"abstract":"With a goal of contributing to multiparty conversation management, this paper proposes a mechanism for estimating conversational dominance in group interaction. Based on our corpus analysis, we have already established a regression model for dominance estimation using speech and gaze information. In this study, we implement the model as a dominance estimation mechanism, and propose an idea of utilizing the mechanism in moderating multiparty conversations between a conversational robot and three human users. The system decides whom the system should talk to based on the dominance level of each user.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125995625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kei Shimonishi, H. Kawashima, Ryo Yonetani, Erina Ishikawa, T. Matsuyama
This paper presents a probabilistic framework to model the gaze generative process when a user is browsing a content consisting of multiple regions. The model enables us to learn multiple aspects of interest from gaze data, to represent and estimate user's interest as a mixture of aspects, and to predict gaze behavior in a unified framework. We recorded gaze data of subjects when they were browsing a digital pictorial book, and confirmed the effectiveness of the proposed model in terms of predicting the gaze target.
{"title":"Learning aspects of interest from Gaze","authors":"Kei Shimonishi, H. Kawashima, Ryo Yonetani, Erina Ishikawa, T. Matsuyama","doi":"10.1145/2535948.2535955","DOIUrl":"https://doi.org/10.1145/2535948.2535955","url":null,"abstract":"This paper presents a probabilistic framework to model the gaze generative process when a user is browsing a content consisting of multiple regions. The model enables us to learn multiple aspects of interest from gaze data, to represent and estimate user's interest as a mixture of aspects, and to predict gaze behavior in a unified framework. We recorded gaze data of subjects when they were browsing a digital pictorial book, and confirmed the effectiveness of the proposed model in terms of predicting the gaze target.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121728347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An important aspect in short dialogues is attention as is manifested by eye-contact between subjects. In this study we provide a first analysis whether such visual attention is evident in the acoustic properties of a speaker's voice. We thereby introduce the multi-modal GRAS2 corpus, which was recorded for analysing attention in human-to-human interactions of short daily-life interactions with strangers in public places in Graz, Austria. Recordings of four test subjects equipped with eye tracking glasses, three audio recording devices, and motion sensors are contained in the corpus. We describe how we robustly identify speech segments from the subjects and other people in an unsupervised manner from multi-channel recordings. We then discuss correlations between the acoustics of the voice in these segments and the point of visual attention of the subjects. A significant relation between the acoustic features and the distance between the point of view and the eye region of the dialogue partner is found. Further, we show that automatic classification of binary decision eye-contact vs. no eye-contact from acoustic features alone is feasible with an Unweighted Average Recall of up to 70%.
{"title":"The acoustics of eye contact: detecting visual attention from conversational audio cues","authors":"F. Eyben, F. Weninger, L. Paletta, Björn Schuller","doi":"10.1145/2535948.2535949","DOIUrl":"https://doi.org/10.1145/2535948.2535949","url":null,"abstract":"An important aspect in short dialogues is attention as is manifested by eye-contact between subjects. In this study we provide a first analysis whether such visual attention is evident in the acoustic properties of a speaker's voice. We thereby introduce the multi-modal GRAS2 corpus, which was recorded for analysing attention in human-to-human interactions of short daily-life interactions with strangers in public places in Graz, Austria. Recordings of four test subjects equipped with eye tracking glasses, three audio recording devices, and motion sensors are contained in the corpus. We describe how we robustly identify speech segments from the subjects and other people in an unsupervised manner from multi-channel recordings. We then discuss correlations between the acoustics of the voice in these segments and the point of visual attention of the subjects. A significant relation between the acoustic features and the distance between the point of view and the eye region of the dialogue partner is found. Further, we show that automatic classification of binary decision eye-contact vs. no eye-contact from acoustic features alone is feasible with an Unweighted Average Recall of up to 70%.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131855947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rucha Kulkarni, Kritika Jain, H. Bansal, S. Bangalore, M. Carl
Researchers are proposing interactive machine translation as a potential method to make language translation process more efficient and usable. Introduction of different modalities like eye gaze and speech are being explored to add to the interactivity of language translation system. Unfortunately, the raw data provided by Automatic Speech Recognition (ASR) and Eye-Tracking is very noisy and erroneous. This paper describes a technique for reducing the errors of the two modalities, speech and eye-gaze with the help of each other in context of sight translation and reading. Lattice representation and composition of the two modalities was used for integration. F-measure for Eye-Gaze and Word Accuracy for ASR were used as metrics to evaluate our results. In reading task, we demonstrated a significant improvement in both Eye-Gaze f-measure and speech Word Accuracy. In sight translation task, significant improvement was found in gaze f-measure but not in ASR.
{"title":"Mutual disambiguation of eye gaze and speech for sight translation and reading","authors":"Rucha Kulkarni, Kritika Jain, H. Bansal, S. Bangalore, M. Carl","doi":"10.1145/2535948.2535953","DOIUrl":"https://doi.org/10.1145/2535948.2535953","url":null,"abstract":"Researchers are proposing interactive machine translation as a potential method to make language translation process more efficient and usable. Introduction of different modalities like eye gaze and speech are being explored to add to the interactivity of language translation system. Unfortunately, the raw data provided by Automatic Speech Recognition (ASR) and Eye-Tracking is very noisy and erroneous. This paper describes a technique for reducing the errors of the two modalities, speech and eye-gaze with the help of each other in context of sight translation and reading. Lattice representation and composition of the two modalities was used for integration. F-measure for Eye-Gaze and Word Accuracy for ASR were used as metrics to evaluate our results. In reading task, we demonstrated a significant improvement in both Eye-Gaze f-measure and speech Word Accuracy. In sight translation task, significant improvement was found in gaze f-measure but not in ASR.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128475608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teruhisa Misu, Antoine Raux, Ian Lane, Joan Devassy, Rakesh Gupta
In this paper, we address Townsurfer, a situated multi-modal dialog system in vehicles. The system integrates multi-modal inputs of speech, geo-location, gaze (face direction) and dialog history to answer drivers' queries about their surroundings. To select appropriate data source used to answer queries, we apply belief tracking across the above modalities. We conducted a preliminary data collection and an evaluation focusing on the effect of gaze (head irection) and geo-location estimations. We report the result and analysis on the data.
{"title":"Situated multi-modal dialog system in vehicles","authors":"Teruhisa Misu, Antoine Raux, Ian Lane, Joan Devassy, Rakesh Gupta","doi":"10.1145/2535948.2535951","DOIUrl":"https://doi.org/10.1145/2535948.2535951","url":null,"abstract":"In this paper, we address Townsurfer, a situated multi-modal dialog system in vehicles. The system integrates multi-modal inputs of speech, geo-location, gaze (face direction) and dialog history to answer drivers' queries about their surroundings. To select appropriate data source used to answer queries, we apply belief tracking across the above modalities. We conducted a preliminary data collection and an evaluation focusing on the effect of gaze (head irection) and geo-location estimations. We report the result and analysis on the data.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129716520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takatsugu Hirayama, Takafumi Marutani, Daishi Tanoue, Shogo Tokai, S. Fels, K. Mase
Humans see things from various viewpoints but nobody attempts to see anything from every viewpoint owing to physical restrictions and the great effort required. Intelligent interfaces for viewing multi-viewpoint videos may remove the restrictions in effective ways and direct us toward a new visual world. We propose an agent-assisted multi-viewpoint video viewer that incorporates (1) target-centered viewpoint switching and (2) social viewpoint recommendation. The viewer stabilizes an object at the center of the display field using the former function, which helps to fix the user's gaze on the target object. To identify the popular viewing behavior for particular content, the latter function exploits a histogram of the viewing log in terms of time, viewpoints, and the target of many personal viewing experiences. We call this knowledge source of the director agent a viewgram. The agent automatically constructs the preferred viewpoint sequence for each target. We conducted user studies to analyze user behavior, especially eye movement, while using the viewer. The results of statistical analyses showed that the viewpoint sequence extracted from a viewgram includes a more distinct perspective for each target, and the target-centered viewpoint switching encourages the user to gaze at the display center where the target is located during the viewing. The proposed viewer can provide more effective perspectives for the main attractions in scenes.
{"title":"Agent-assisted multi-viewpoint video viewer and its gaze-based evaluation","authors":"Takatsugu Hirayama, Takafumi Marutani, Daishi Tanoue, Shogo Tokai, S. Fels, K. Mase","doi":"10.1145/2535948.2535952","DOIUrl":"https://doi.org/10.1145/2535948.2535952","url":null,"abstract":"Humans see things from various viewpoints but nobody attempts to see anything from every viewpoint owing to physical restrictions and the great effort required. Intelligent interfaces for viewing multi-viewpoint videos may remove the restrictions in effective ways and direct us toward a new visual world. We propose an agent-assisted multi-viewpoint video viewer that incorporates (1) target-centered viewpoint switching and (2) social viewpoint recommendation. The viewer stabilizes an object at the center of the display field using the former function, which helps to fix the user's gaze on the target object. To identify the popular viewing behavior for particular content, the latter function exploits a histogram of the viewing log in terms of time, viewpoints, and the target of many personal viewing experiences. We call this knowledge source of the director agent a viewgram. The agent automatically constructs the preferred viewpoint sequence for each target. We conducted user studies to analyze user behavior, especially eye movement, while using the viewer. The results of statistical analyses showed that the viewpoint sequence extracted from a viewgram includes a more distinct perspective for each target, and the target-centered viewpoint switching encourages the user to gaze at the display center where the target is located during the viewing. The proposed viewer can provide more effective perspectives for the main attractions in scenes.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"35 16","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120971324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shochi Otogi, Hung-Hsuan Huang, R. Hotta, K. Kawagoe
As the advance of embodied conversational agent (ECA) technologies, there are more and more real-world deployed applications of ECA's like the guides in museums or exhibitions. However, in those situations, the agent systems are usually used by groups of visitors rather than individuals. In such multi-user situation which is much more complex than single user one, specific features are required. One of them is the ability for the agent to smoothly intervene user-user conversation. This feature is supposed to facilitate mixed-initiative human-agent conversation and more proactive service for the users. This paper presents the results of the first step of our project that aims to build an information providing the agent for collaborative decision making tasks, finding the timings for the agent to intervene user-user conversation to provide active support by focusing on the user's gaze. In order to realize this, at first, a Wizard-of- Oz (WOZ) experiment was conducted for collecting human interaction data. By analyzing the collected corpus, eight kinds of timings which allow the agent to do intervention potentially were found. Second, a method was developed to automatically identify four of the eight kinds of timings only by using nonverbal cues, gaze direction, body posture, and speech information. Although the performance of the method is moderate (F-measure 0.4), it should be able to be improved by integrating context information in the future.
{"title":"Finding the timings for a guide agent to interveneinter-user conversation in considering their gazebehaviors","authors":"Shochi Otogi, Hung-Hsuan Huang, R. Hotta, K. Kawagoe","doi":"10.1145/2535948.2535957","DOIUrl":"https://doi.org/10.1145/2535948.2535957","url":null,"abstract":"As the advance of embodied conversational agent (ECA) technologies, there are more and more real-world deployed applications of ECA's like the guides in museums or exhibitions. However, in those situations, the agent systems are usually used by groups of visitors rather than individuals. In such multi-user situation which is much more complex than single user one, specific features are required. One of them is the ability for the agent to smoothly intervene user-user conversation. This feature is supposed to facilitate mixed-initiative human-agent conversation and more proactive service for the users. This paper presents the results of the first step of our project that aims to build an information providing the agent for collaborative decision making tasks, finding the timings for the agent to intervene user-user conversation to provide active support by focusing on the user's gaze. In order to realize this, at first, a Wizard-of- Oz (WOZ) experiment was conducted for collecting human interaction data. By analyzing the collected corpus, eight kinds of timings which allow the agent to do intervention potentially were found. Second, a method was developed to automatically identify four of the eight kinds of timings only by using nonverbal cues, gaze direction, body posture, and speech information. Although the performance of the method is moderate (F-measure 0.4), it should be able to be improved by integrating context information in the future.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133100513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Using dual eye tracking we performed a study characterising the differences in interaction patterns while learning online materials individually or with a peer. The findings show that in majority of cases, user prefer to use the online learning materials in parallel when working on a learning task on their own tool. Collaborative learning took longer due to negotiation overheads, and most attention was paid to the materials. However, collaboration did not have effects on the overall distribution of gaze.
{"title":"Unrawelling the interaction strategies and gaze in collaborative learning with online video lectures","authors":"R. Bednarik, Marko Kauppinen","doi":"10.1145/2535948.2535959","DOIUrl":"https://doi.org/10.1145/2535948.2535959","url":null,"abstract":"Using dual eye tracking we performed a study characterising the differences in interaction patterns while learning online materials individually or with a peer. The findings show that in majority of cases, user prefer to use the online learning materials in parallel when working on a learning task on their own tool. Collaborative learning took longer due to negotiation overheads, and most attention was paid to the materials. However, collaboration did not have effects on the overall distribution of gaze.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"774 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127154468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samira Sheikhi, D. Jayagopi, Vasil Khalidov, J. Odobez
The paper investigates the problem of addressee recognition -to whom a speaker's utterance is intended- in a setting involving a humanoid robot interacting with multiple persons. More specifically, as it is well known that addressee can primarily be derived from the speaker's visual focus of attention (VFOA) defined as whom or what a person is looking at, we address the following questions: how much does the performance degrade when using automatically extracted VFOA from head pose instead of the VFOA ground-truth? Can the conversational context improve addressee recognition by using it either directly as a side cue in the addressee classifier, or indirectly by improving the VFOA recognition, or in both ways? Finally, from a computational perspective, which VFOA features and normalizations are better and does it matter whether the VFOA recognition module only monitors whether a person looks at potential addressee targets (the robot, people) or if it also considers objects of interest in the environment (paintings in our case) as additional VFOA targets? Experiments on the public Vernissage database where the humanoid Nao robots make a quiz to two participants shows that reducing VFOA confusion (either through context, or by ignoring VFOA targets) improves addressee recognition.
{"title":"Context aware addressee estimation for human robot interaction","authors":"Samira Sheikhi, D. Jayagopi, Vasil Khalidov, J. Odobez","doi":"10.1145/2535948.2535958","DOIUrl":"https://doi.org/10.1145/2535948.2535958","url":null,"abstract":"The paper investigates the problem of addressee recognition -to whom a speaker's utterance is intended- in a setting involving a humanoid robot interacting with multiple persons. More specifically, as it is well known that addressee can primarily be derived from the speaker's visual focus of attention (VFOA) defined as whom or what a person is looking at, we address the following questions: how much does the performance degrade when using automatically extracted VFOA from head pose instead of the VFOA ground-truth? Can the conversational context improve addressee recognition by using it either directly as a side cue in the addressee classifier, or indirectly by improving the VFOA recognition, or in both ways? Finally, from a computational perspective, which VFOA features and normalizations are better and does it matter whether the VFOA recognition module only monitors whether a person looks at potential addressee targets (the robot, people) or if it also considers objects of interest in the environment (paintings in our case) as additional VFOA targets? Experiments on the public Vernissage database where the humanoid Nao robots make a quiz to two participants shows that reducing VFOA confusion (either through context, or by ignoring VFOA targets) improves addressee recognition.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As we navigate our environment, we are constantly assessing the objects we encounter and deciding on their subjective interest to us. In this study, we investigate the neural and ocular correlates of this assessment as a step towards their potential use in a mobile human-computer interface (HCI). Past research has shown that multiple physiological signals are evoked by objects of interest during visual search in the laboratory, including gaze, pupil dilation, and neural activity; these have been exploited for use in various HCIs. We use a virtual environment to explore which of these signals are also evoked during exploration of a dynamic, free-viewing 3D environment. Using a hierarchical classifier and sequential forward floating selection (SFFS), we identify a small, robust set of features across multiple modalities that can be used to distinguish targets from distractors in the virtual environment. The identification of these features may serve as an important factor in the design of mobile HCIs.
{"title":"Feature selection for gaze, pupillary, and EEG signals evoked in a 3D environment","authors":"D. Jangraw, P. Sajda","doi":"10.1145/2535948.2535950","DOIUrl":"https://doi.org/10.1145/2535948.2535950","url":null,"abstract":"As we navigate our environment, we are constantly assessing the objects we encounter and deciding on their subjective interest to us. In this study, we investigate the neural and ocular correlates of this assessment as a step towards their potential use in a mobile human-computer interface (HCI). Past research has shown that multiple physiological signals are evoked by objects of interest during visual search in the laboratory, including gaze, pupil dilation, and neural activity; these have been exploited for use in various HCIs. We use a virtual environment to explore which of these signals are also evoked during exploration of a dynamic, free-viewing 3D environment. Using a hierarchical classifier and sequential forward floating selection (SFFS), we identify a small, robust set of features across multiple modalities that can be used to distinguish targets from distractors in the virtual environment. The identification of these features may serve as an important factor in the design of mobile HCIs.","PeriodicalId":403097,"journal":{"name":"GazeIn '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133651623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}