Recently a number of TV manufacturers introduced TV remotes with a touchpad which is used for indirect control of TV UI. Users can navigate the UI by moving a finger across the touch pad. However, due to the latency in visual feedback, there is a disconnection between the finger movement on the touchpad and the visual perception in the TV UI, which often causes overshooting. In this paper, we investigate how haptic feedback affects the user experience of the touchpad-based TV remote. We described two haptic prototypes built on the smartphone and Samsung 2013 TV remote respectively. We conducted two user studies with two prototypes to evaluate how the user preference and the user performance been affected. The results show that there is overwhelming support of haptic feedback in terms of subjective user preference, though we didn't find significant difference in performance between with and without haptic feedback conditions.
{"title":"Active Haptic Feedback for Touch Enabled TV Remote","authors":"Anton Treskunov, Mike Darnell, Rongrong Wang","doi":"10.1145/2818346.2820768","DOIUrl":"https://doi.org/10.1145/2818346.2820768","url":null,"abstract":"Recently a number of TV manufacturers introduced TV remotes with a touchpad which is used for indirect control of TV UI. Users can navigate the UI by moving a finger across the touch pad. However, due to the latency in visual feedback, there is a disconnection between the finger movement on the touchpad and the visual perception in the TV UI, which often causes overshooting. In this paper, we investigate how haptic feedback affects the user experience of the touchpad-based TV remote. We described two haptic prototypes built on the smartphone and Samsung 2013 TV remote respectively. We conducted two user studies with two prototypes to evaluate how the user preference and the user performance been affected. The results show that there is overwhelming support of haptic feedback in terms of subjective user preference, though we didn't find significant difference in performance between with and without haptic feedback conditions.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86952788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alina Roitberg, N. Somani, A. Perzylo, Markus Rickert, A. Knoll
We present an approach for monitoring and interpreting human activities based on a novel multimodal vision-based interface, aiming at improving the efficiency of human-robot interaction (HRI) in industrial environments. Multi-modality is an important concept in this design, where we combine inputs from several state-of-the-art sensors to provide a variety of information, e.g. skeleton and fingertip poses. Based on typical industrial workflows, we derived multiple levels of human activity labels, including large-scale activities (e.g. assembly) and simpler sub-activities (e.g. hand gestures), creating a duration- and complexity-based hierarchy. We train supervised generative classifiers for each activity level and combine the output of this stage with a trained Hierarchical Hidden Markov Model (HHMM), which models not only the temporal aspects between the activities on the same level, but also the hierarchical relationships between the levels.
{"title":"Multimodal Human Activity Recognition for Industrial Manufacturing Processes in Robotic Workcells","authors":"Alina Roitberg, N. Somani, A. Perzylo, Markus Rickert, A. Knoll","doi":"10.1145/2818346.2820738","DOIUrl":"https://doi.org/10.1145/2818346.2820738","url":null,"abstract":"We present an approach for monitoring and interpreting human activities based on a novel multimodal vision-based interface, aiming at improving the efficiency of human-robot interaction (HRI) in industrial environments. Multi-modality is an important concept in this design, where we combine inputs from several state-of-the-art sensors to provide a variety of information, e.g. skeleton and fingertip poses. Based on typical industrial workflows, we derived multiple levels of human activity labels, including large-scale activities (e.g. assembly) and simpler sub-activities (e.g. hand gestures), creating a duration- and complexity-based hierarchy. We train supervised generative classifiers for each activity level and combine the output of this stage with a trained Hierarchical Hidden Markov Model (HHMM), which models not only the temporal aspects between the activities on the same level, but also the hierarchical relationships between the levels.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88563683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present the first step of a methodology dedicated to deduce automatically sequences of signals expressed by humans during an interaction. The aim is to link interpersonal stances with arrangements of social signals such as modulations of Action Units and prosody during a face-to-face exchange. The long-term goal is to infer association rules of signals. We plan to use them as an input to the animation of an Embodied Conversational Agent (ECA). In this paper, we illustrate the proposed methodology to the SEMAINE-DB corpus from which we automatically extracted Action Units (AUs), head positions, turn-taking and prosody information. We have applied the data mining algorithm that is used to find the sequences of social signals featuring different social stances. We finally discuss our primary results focusing on given AUs (smiles and eyebrows) and the perspectives of this method.
{"title":"Temporal Association Rules for Modelling Multimodal Social Signals","authors":"Thomas Janssoone","doi":"10.1145/2818346.2823305","DOIUrl":"https://doi.org/10.1145/2818346.2823305","url":null,"abstract":"In this paper, we present the first step of a methodology dedicated to deduce automatically sequences of signals expressed by humans during an interaction. The aim is to link interpersonal stances with arrangements of social signals such as modulations of Action Units and prosody during a face-to-face exchange. The long-term goal is to infer association rules of signals. We plan to use them as an input to the animation of an Embodied Conversational Agent (ECA). In this paper, we illustrate the proposed methodology to the SEMAINE-DB corpus from which we automatically extracted Action Units (AUs), head positions, turn-taking and prosody information. We have applied the data mining algorithm that is used to find the sequences of social signals featuring different social stances. We finally discuss our primary results focusing on given AUs (smiles and eyebrows) and the perspectives of this method.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85943784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interact is a mobile virtual assistant that uses multimodal dialog to enable an interactive concierge experience over multiple application domains including hotel, restaurants, events, and TV search. Interact demonstrates how multi- modal interaction combined with conversational dialog en- ables a richer and more natural user experience. This demonstration will highlight incremental recognition and under- standing, multimodal speech and gesture input, context track- ing over multiple simultaneous domains, and the use of multimodal interface techniques to enable disambiguation of erors and online personalization.
{"title":"Interact: Tightly-coupling Multimodal Dialog with an Interactive Virtual Assistant","authors":"Ethan Selfridge, Michael Johnston","doi":"10.1145/2818346.2823301","DOIUrl":"https://doi.org/10.1145/2818346.2823301","url":null,"abstract":"Interact is a mobile virtual assistant that uses multimodal dialog to enable an interactive concierge experience over multiple application domains including hotel, restaurants, events, and TV search. Interact demonstrates how multi- modal interaction combined with conversational dialog en- ables a richer and more natural user experience. This demonstration will highlight incremental recognition and under- standing, multimodal speech and gesture input, context track- ing over multiple simultaneous domains, and the use of multimodal interface techniques to enable disambiguation of erors and online personalization.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80453205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whole video. The audio features are extracted using OpenSMILE or RASTA-PLP methods. Video and audio features are classified using SVM classifiers and fused at the score level. We tested eight different combinations of audio and visual features on the AFEW 5.0 (Acted Facial Expressions in the Wild) database provided by the challenge organizers. The best visual and audio-visual accuracies obtained on the test set are 45.1% and 49.9% respectively, whereas the video-based baseline for the challenge is given as 39.3%.
{"title":"Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction","authors":"M. Kayaoglu, Ç. Erdem","doi":"10.1145/2818346.2830594","DOIUrl":"https://doi.org/10.1145/2818346.2830594","url":null,"abstract":"In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whole video. The audio features are extracted using OpenSMILE or RASTA-PLP methods. Video and audio features are classified using SVM classifiers and fused at the score level. We tested eight different combinations of audio and visual features on the AFEW 5.0 (Acted Facial Expressions in the Wild) database provided by the challenge organizers. The best visual and audio-visual accuracies obtained on the test set are 45.1% and 49.9% respectively, whereas the video-based baseline for the challenge is given as 39.3%.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89616053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elderly people often need support in everyday situations -- e.g. common daily life activities like taking care of house and garden, or caring for an animal are often not possible without a larger support circle. However, especially in larger western cities, local social networks may not be very tight, friends may have moved away or died, and the traditional support structures found in so-called multi-generational families do not exist anymore. As a result, the quality of life for elderly people suffers crucially. On the other hand, people from the broader neighborhood would often gladly help and respond quickly. With the project Wir im Kiez we developed and tested a multimodal social network app equipped with a conversational interface that addresses these issues. In the demonstration, we especially focus on the needs and restrictions of seniors, both in their physical and psychological limitations.
老年人在日常生活中经常需要帮助,例如,如果没有一个更大的支持圈,像照顾房子和花园或照顾动物这样的日常生活活动往往是不可能的。然而,特别是在较大的西方城市,当地的社交网络可能不是很紧密,朋友可能已经搬走或去世,所谓的几代同堂家庭中的传统支持结构已经不复存在。因此,老年人的生活质量受到严重影响。另一方面,来自更广泛社区的人通常会很乐意帮助并迅速作出反应。在Wir im Kiez项目中,我们开发并测试了一个多模式社交网络应用程序,该应用程序配备了一个对话界面,可以解决这些问题。在演示中,我们特别关注老年人的需求和限制,包括他们的生理和心理限制。
{"title":"Wir im Kiez: Multimodal App for Mutual Help Among Elderly Neighbours","authors":"S. Schmeier, Aaron Ruß, Norbert Reithinger","doi":"10.1145/2818346.2823300","DOIUrl":"https://doi.org/10.1145/2818346.2823300","url":null,"abstract":"Elderly people often need support in everyday situations -- e.g. common daily life activities like taking care of house and garden, or caring for an animal are often not possible without a larger support circle. However, especially in larger western cities, local social networks may not be very tight, friends may have moved away or died, and the traditional support structures found in so-called multi-generational families do not exist anymore. As a result, the quality of life for elderly people suffers crucially. On the other hand, people from the broader neighborhood would often gladly help and respond quickly. With the project Wir im Kiez we developed and tested a multimodal social network app equipped with a conversational interface that addresses these issues. In the demonstration, we especially focus on the needs and restrictions of seniors, both in their physical and psychological limitations.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80955341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julia Wache, Subramanian Ramanathan, M. K. Abadi, R. Vieriu, N. Sebe, Stefan Winkler
We present a novel framework for recognizing personality traits based on users' physiological responses to affective movie clips. Extending studies that have correlated explicit/implicit affective user responses with Extraversion and Neuroticism traits, we perform single-trial recognition of the big-five traits from Electrocardiogram (ECG), Galvanic Skin Response (GSR), Electroencephalogram (EEG) and facial emotional responses compiled from 36 users using off-the-shelf sensors. Firstly, we examine relationships among personality scales and (explicit) affective user ratings acquired in the context of prior observations. Secondly, we isolate physiological correlates of personality traits. Finally, unimodal and multimodal personality recognition results are presented. Personality differences are better revealed while analyzing responses to emotionally homogeneous (e.g., high valence, high arousal) clips, and significantly above-chance recognition is achieved for all five traits.
{"title":"Implicit User-centric Personality Recognition Based on Physiological Responses to Emotional Videos","authors":"Julia Wache, Subramanian Ramanathan, M. K. Abadi, R. Vieriu, N. Sebe, Stefan Winkler","doi":"10.1145/2818346.2820736","DOIUrl":"https://doi.org/10.1145/2818346.2820736","url":null,"abstract":"We present a novel framework for recognizing personality traits based on users' physiological responses to affective movie clips. Extending studies that have correlated explicit/implicit affective user responses with Extraversion and Neuroticism traits, we perform single-trial recognition of the big-five traits from Electrocardiogram (ECG), Galvanic Skin Response (GSR), Electroencephalogram (EEG) and facial emotional responses compiled from 36 users using off-the-shelf sensors. Firstly, we examine relationships among personality scales and (explicit) affective user ratings acquired in the context of prior observations. Secondly, we isolate physiological correlates of personality traits. Finally, unimodal and multimodal personality recognition results are presented. Personality differences are better revealed while analyzing responses to emotionally homogeneous (e.g., high valence, high arousal) clips, and significantly above-chance recognition is achieved for all five traits.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89095858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Address 2","authors":"P. Cohen","doi":"10.1145/3252444","DOIUrl":"https://doi.org/10.1145/3252444","url":null,"abstract":"","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81073990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a speech synthesis method for people with articulation disorders. Because the movements of such speakers are limited by their athetoid symptoms, their prosody is often unstable and their speech rate differs from that of a physically unimpaired person, which causes their speech to be less intelligible and, consequently, makes communication with physically unimpaired persons difficult. In order to deal with these problems, this paper describes a Hidden Markov Model(HMM)-based text-to-speech synthesis approach that preserves the individuality of a person with an articulation disorder and aids them in their communication. In our method, a duration model of a physically unimpaired person is used for the HMM synthesis system and an F0 model in the system is trained using the F0 patterns of the physically unimpaired person, with the average F0 being converted to the target F0 in advance. In order to preserve the target speaker's individuality, a spectral model is built from target spectra. Through experimental evaluations, we have confirmed that the proposed method successfully synthesizes intelligible speech while maintaining the target speaker's individuality.
{"title":"Individuality-Preserving Voice Reconstruction for Articulation Disorders Using Text-to-Speech Synthesis","authors":"Reina Ueda, T. Takiguchi, Y. Ariki","doi":"10.1145/2818346.2820770","DOIUrl":"https://doi.org/10.1145/2818346.2820770","url":null,"abstract":"This paper presents a speech synthesis method for people with articulation disorders. Because the movements of such speakers are limited by their athetoid symptoms, their prosody is often unstable and their speech rate differs from that of a physically unimpaired person, which causes their speech to be less intelligible and, consequently, makes communication with physically unimpaired persons difficult. In order to deal with these problems, this paper describes a Hidden Markov Model(HMM)-based text-to-speech synthesis approach that preserves the individuality of a person with an articulation disorder and aids them in their communication. In our method, a duration model of a physically unimpaired person is used for the HMM synthesis system and an F0 model in the system is trained using the F0 patterns of the physically unimpaired person, with the average F0 being converted to the target F0 in advance. In order to preserve the target speaker's individuality, a spectral model is built from target spectra. Through experimental evaluations, we have confirmed that the proposed method successfully synthesizes intelligible speech while maintaining the target speaker's individuality.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80138812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Oral Session 4: Communication Dynamics","authors":"Louis-Philippe Morency","doi":"10.1145/3252449","DOIUrl":"https://doi.org/10.1145/3252449","url":null,"abstract":"","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78604567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}