R. Bixler, Nathaniel Blanchard, L. Garrison, S. D’Mello
Mind wandering (MW) entails an involuntary shift in attention from task-related thoughts to task-unrelated thoughts, and has been shown to have detrimental effects on performance in a number of contexts. This paper proposes an automated multimodal detector of MW using eye gaze and physiology (skin conductance and skin temperature) and aspects of the context (e.g., time on task, task difficulty). Data in the form of eye gaze and physiological signals were collected as 178 participants read four instructional texts from a computer interface. Participants periodically provided self-reports of MW in response to pseudorandom auditory probes during reading. Supervised machine learning models trained on features extracted from participants' gaze fixations, physiological signals, and contextual cues were used to detect pages where participants provided positive responses of MW to the auditory probes. Two methods of combining gaze and physiology features were explored. Feature level fusion entailed building a single model by combining feature vectors from individual modalities. Decision level fusion entailed building individual models for each modality and adjudicating amongst individual decisions. Feature level fusion resulted in an 11% improvement in classification accuracy over the best unimodal model, but there was no comparable improvement for decision level fusion. This was reflected by a small improvement in both precision and recall. An analysis of the features indicated that MW was associated with fewer and longer fixations and saccades, and a higher more deterministic skin temperature. Possible applications of the detector are discussed.
{"title":"Automatic Detection of Mind Wandering During Reading Using Gaze and Physiology","authors":"R. Bixler, Nathaniel Blanchard, L. Garrison, S. D’Mello","doi":"10.1145/2818346.2820742","DOIUrl":"https://doi.org/10.1145/2818346.2820742","url":null,"abstract":"Mind wandering (MW) entails an involuntary shift in attention from task-related thoughts to task-unrelated thoughts, and has been shown to have detrimental effects on performance in a number of contexts. This paper proposes an automated multimodal detector of MW using eye gaze and physiology (skin conductance and skin temperature) and aspects of the context (e.g., time on task, task difficulty). Data in the form of eye gaze and physiological signals were collected as 178 participants read four instructional texts from a computer interface. Participants periodically provided self-reports of MW in response to pseudorandom auditory probes during reading. Supervised machine learning models trained on features extracted from participants' gaze fixations, physiological signals, and contextual cues were used to detect pages where participants provided positive responses of MW to the auditory probes. Two methods of combining gaze and physiology features were explored. Feature level fusion entailed building a single model by combining feature vectors from individual modalities. Decision level fusion entailed building individual models for each modality and adjudicating amongst individual decisions. Feature level fusion resulted in an 11% improvement in classification accuracy over the best unimodal model, but there was no comparable improvement for decision level fusion. This was reflected by a small improvement in both precision and recall. An analysis of the features indicated that MW was associated with fewer and longer fixations and saccades, and a higher more deterministic skin temperature. Possible applications of the detector are discussed.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76362634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Affect detection is an important component of computerized learning environments that adapt the interface and materials to students' affect. This paper proposes a plan for developing and testing multimodal affect detectors that generalize across differences in data that are likely to occur in practical applications (e.g., time, demographic variables). Facial features and interaction log features are considered as modalities for affect detection in this scenario, each with their own advantages. Results are presented for completed work evaluating the accuracy of individual modality face- and interaction- based detectors, accuracy and availability of a multimodal combination of these modalities, and initial steps toward generalization of face-based detectors. Additional data collection needed for cross-culture generalization testing is also completed. Challenges and possible solutions for proposed cross-cultural generalization testing of multimodal detectors are also discussed.
{"title":"Multimodal Affect Detection in the Wild: Accuracy, Availability, and Generalizability","authors":"Nigel Bosch","doi":"10.1145/2818346.2823316","DOIUrl":"https://doi.org/10.1145/2818346.2823316","url":null,"abstract":"Affect detection is an important component of computerized learning environments that adapt the interface and materials to students' affect. This paper proposes a plan for developing and testing multimodal affect detectors that generalize across differences in data that are likely to occur in practical applications (e.g., time, demographic variables). Facial features and interaction log features are considered as modalities for affect detection in this scenario, each with their own advantages. Results are presented for completed work evaluating the accuracy of individual modality face- and interaction- based detectors, accuracy and availability of a multimodal combination of these modalities, and initial steps toward generalization of face-based detectors. Additional data collection needed for cross-culture generalization testing is also completed. Challenges and possible solutions for proposed cross-cultural generalization testing of multimodal detectors are also discussed.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80808640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Punarjay Chakravarty, S. Mirzaei, T. Tuytelaars, H. V. hamme
Active speakers have traditionally been identified in video by detecting their moving lips. This paper demonstrates the same using spatio-temporal features that aim to capture other cues: movement of the head, upper body and hands of active speakers. Speaker directional information, obtained using sound source localization from a microphone array is used to supervise the training of these video features.
{"title":"Who's Speaking?: Audio-Supervised Classification of Active Speakers in Video","authors":"Punarjay Chakravarty, S. Mirzaei, T. Tuytelaars, H. V. hamme","doi":"10.1145/2818346.2820780","DOIUrl":"https://doi.org/10.1145/2818346.2820780","url":null,"abstract":"Active speakers have traditionally been identified in video by detecting their moving lips. This paper demonstrates the same using spatio-temporal features that aim to capture other cues: movement of the head, upper body and hands of active speakers. Speaker directional information, obtained using sound source localization from a microphone array is used to supervise the training of these video features.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85668989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun-Nung (Vivian) Chen, Ming Sun, Alexander I. Rudnicky, A. Gershman
Spoken language interfaces are appearing in various smart devices (e.g. smart-phones, smart-TV, in-car navigating systems) and serve as intelligent assistants (IAs). However, most of them do not consider individual users' behavioral profiles and contexts when modeling user intents. Such behavioral patterns are user-specific and provide useful cues to improve spoken language understanding (SLU). This paper focuses on leveraging the app behavior history to improve spoken dialog systems performance. We developed a matrix factorization approach that models speech and app usage patterns to predict user intents (e.g. launching a specific app). We collected multi-turn interactions in a WoZ scenario; users were asked to reproduce the multi-app tasks that they had performed earlier on their smart-phones. By modeling latent semantics behind lexical and behavioral patterns, the proposed multi-model system achieves about 52% of turn accuracy for intent prediction on ASR transcripts.
{"title":"Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding","authors":"Yun-Nung (Vivian) Chen, Ming Sun, Alexander I. Rudnicky, A. Gershman","doi":"10.1145/2818346.2820781","DOIUrl":"https://doi.org/10.1145/2818346.2820781","url":null,"abstract":"Spoken language interfaces are appearing in various smart devices (e.g. smart-phones, smart-TV, in-car navigating systems) and serve as intelligent assistants (IAs). However, most of them do not consider individual users' behavioral profiles and contexts when modeling user intents. Such behavioral patterns are user-specific and provide useful cues to improve spoken language understanding (SLU). This paper focuses on leveraging the app behavior history to improve spoken dialog systems performance. We developed a matrix factorization approach that models speech and app usage patterns to predict user intents (e.g. launching a specific app). We collected multi-turn interactions in a WoZ scenario; users were asked to reproduce the multi-app tasks that they had performed earlier on their smart-phones. By modeling latent semantics behind lexical and behavioral patterns, the proposed multi-model system achieves about 52% of turn accuracy for intent prediction on ASR transcripts.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84886857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kahou, Vincent Michalski, K. Konda, R. Memisevic, C. Pal
Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.
{"title":"Recurrent Neural Networks for Emotion Recognition in Video","authors":"S. Kahou, Vincent Michalski, K. Konda, R. Memisevic, C. Pal","doi":"10.1145/2818346.2830596","DOIUrl":"https://doi.org/10.1145/2818346.2830596","url":null,"abstract":"Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74090351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We explore using the Outer Ear Interface (OEI) to recognize eating activities. OEI contains a 3D gyroscope and a set of proximity sensors encapsulated in an off-the-shelf earpiece to monitor jaw movement by measuring ear canal deformation. In a laboratory setting with 20 participants, OEI could distinguish eating from other activities, such as walking, talking, and silently reading, with over 90% accuracy (user independent). In a second study, six subjects wore the system for 6 hours each while performing their normal daily activities. OEI correctly classified five minute segments of time as eating or non-eating with 93% accuracy (user dependent).
{"title":"Detecting Mastication: A Wearable Approach","authors":"Abdelkareem Bedri, Apoorva Verlekar, Edison Thomaz, Valerie Avva, Thad Starner","doi":"10.1145/2818346.2820767","DOIUrl":"https://doi.org/10.1145/2818346.2820767","url":null,"abstract":"We explore using the Outer Ear Interface (OEI) to recognize eating activities. OEI contains a 3D gyroscope and a set of proximity sensors encapsulated in an off-the-shelf earpiece to monitor jaw movement by measuring ear canal deformation. In a laboratory setting with 20 participants, OEI could distinguish eating from other activities, such as walking, talking, and silently reading, with over 90% accuracy (user independent). In a second study, six subjects wore the system for 6 hours each while performing their normal daily activities. OEI correctly classified five minute segments of time as eating or non-eating with 93% accuracy (user dependent).","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82304242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the main goals in Human Computer Interaction (HCI) is improving the interface between users and computers: Interfacing should be intuitive, effortless and easy to learn. We approach the goal from two opposite but complementary directions: On the one hand, computer-user interaction can be enhanced if the computer can assess users differences in an automated manner. Therefore we collected physiological and psychological data from people exposed to emotional stimuli and created a database for the community to use for further research in the context of automated learning to detect the differences in the inner states of users. We employed the data both to not only predict the emotional state of users but also their personality traits. On the other hand, users need information dispatched by a computer to be easily, intuitively accessible. To minimize the cognitive effort of assimilating information we use a tactile device in form of a belt and test how it can be best used to replace or augment the information received from other senses (e.g., visual and auditory) in a navigation task. We investigate how both approaches can be combined to improve specific applications.
{"title":"Implicit Human-computer Interaction: Two Complementary Approaches","authors":"Julia Wache","doi":"10.1145/2818346.2823311","DOIUrl":"https://doi.org/10.1145/2818346.2823311","url":null,"abstract":"One of the main goals in Human Computer Interaction (HCI) is improving the interface between users and computers: Interfacing should be intuitive, effortless and easy to learn. We approach the goal from two opposite but complementary directions: On the one hand, computer-user interaction can be enhanced if the computer can assess users differences in an automated manner. Therefore we collected physiological and psychological data from people exposed to emotional stimuli and created a database for the community to use for further research in the context of automated learning to detect the differences in the inner states of users. We employed the data both to not only predict the emotional state of users but also their personality traits. On the other hand, users need information dispatched by a computer to be easily, intuitively accessible. To minimize the cognitive effort of assimilating information we use a tactile device in form of a belt and test how it can be best used to replace or augment the information received from other senses (e.g., visual and auditory) in a navigation task. We investigate how both approaches can be combined to improve specific applications.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78006128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah Strohkorb, Iolanda Leite, Natalie Warren, B. Scassellati
As social robots become more widespread in educational environments, their ability to understand group dynamics and engage multiple children in social interactions is crucial. Social dominance is a highly influential factor in social interactions, expressed through both verbal and nonverbal behaviors. In this paper, we present a method for determining whether a participant is high or low in social dominance in a group interaction with children and robots. We investigated the correlation between many verbal and nonverbal behavioral features with social dominance levels collected through teacher surveys. We additionally implemented Logistic Regression and Support Vector Machines models with classification accuracies of 81% and 89%, respectively, showing that using a small subset of nonverbal behavioral features, these models can successfully classify children's social dominance level. Our approach for classifying social dominance is novel not only for its application to children, but also for achieving high classification accuracies using a reduced set of nonverbal features that, in future work, can be automatically extracted with current sensing technology.
{"title":"Classification of Children's Social Dominance in Group Interactions with Robots","authors":"Sarah Strohkorb, Iolanda Leite, Natalie Warren, B. Scassellati","doi":"10.1145/2818346.2820735","DOIUrl":"https://doi.org/10.1145/2818346.2820735","url":null,"abstract":"As social robots become more widespread in educational environments, their ability to understand group dynamics and engage multiple children in social interactions is crucial. Social dominance is a highly influential factor in social interactions, expressed through both verbal and nonverbal behaviors. In this paper, we present a method for determining whether a participant is high or low in social dominance in a group interaction with children and robots. We investigated the correlation between many verbal and nonverbal behavioral features with social dominance levels collected through teacher surveys. We additionally implemented Logistic Regression and Support Vector Machines models with classification accuracies of 81% and 89%, respectively, showing that using a small subset of nonverbal behavioral features, these models can successfully classify children's social dominance level. Our approach for classifying social dominance is novel not only for its application to children, but also for achieving high classification accuracies using a reduced set of nonverbal features that, in future work, can be automatically extracted with current sensing technology.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"32 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91488971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geographic Information Systems (GIS) offers a large amount of functions for performing spatial analysis and geospatial information retrieval. However, off-the-shelf GIS remains difficult to use for occasional GIS experts. The major problem lies in that its interface organizes spatial analysis tools and functions according to spatial data structures and corresponding algorithms, which is conceptually confusing and cognitively complex. Prior work identified the usability problem of conventional GIS interface and developed alternatives based on speech or gesture to narrow the gap between the high-functionality provided by GIS and its usability. This paper outlined my doctoral research goal in understanding human-GIS interaction activity, especially how interaction modalities assist to capture spatial analysis intention and influence collaborative spatial problem solving. We proposed a framework for enabling multimodal human-GIS interaction driven by intention. We also implemented a prototype GeoEASI (Geo-dialogue Environment for Assisted Spatial Inquiry) to demonstrate the effectiveness of our framework. GeoEASI understands commonly known spatial analysis intentions through multimodal techniques and is able to assist users to perform spatial analysis with proper strategies. Further work will evaluate the effectiveness of our framework, improve the reliability and flexibility of the system, extend the GIS interface for supporting multiple users, and integrate the system into GeoDeliberation. We will concentrate on how multimodality technology can be adopted in these circumstances and explore the potentials of it. The study aims to demonstrate the feasibility of building a GIS to be both useful and usable by introducing an intent-driven multimodal interface, forming the key to building a better theory of spatial thinking for GIS.
{"title":"Exploring Intent-driven Multimodal Interface for Geographical Information System","authors":"Feng Sun","doi":"10.1145/2818346.2823304","DOIUrl":"https://doi.org/10.1145/2818346.2823304","url":null,"abstract":"Geographic Information Systems (GIS) offers a large amount of functions for performing spatial analysis and geospatial information retrieval. However, off-the-shelf GIS remains difficult to use for occasional GIS experts. The major problem lies in that its interface organizes spatial analysis tools and functions according to spatial data structures and corresponding algorithms, which is conceptually confusing and cognitively complex. Prior work identified the usability problem of conventional GIS interface and developed alternatives based on speech or gesture to narrow the gap between the high-functionality provided by GIS and its usability. This paper outlined my doctoral research goal in understanding human-GIS interaction activity, especially how interaction modalities assist to capture spatial analysis intention and influence collaborative spatial problem solving. We proposed a framework for enabling multimodal human-GIS interaction driven by intention. We also implemented a prototype GeoEASI (Geo-dialogue Environment for Assisted Spatial Inquiry) to demonstrate the effectiveness of our framework. GeoEASI understands commonly known spatial analysis intentions through multimodal techniques and is able to assist users to perform spatial analysis with proper strategies. Further work will evaluate the effectiveness of our framework, improve the reliability and flexibility of the system, extend the GIS interface for supporting multiple users, and integrate the system into GeoDeliberation. We will concentrate on how multimodality technology can be adopted in these circumstances and explore the potentials of it. The study aims to demonstrate the feasibility of building a GIS to be both useful and usable by introducing an intent-driven multimodal interface, forming the key to building a better theory of spatial thinking for GIS.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86350987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To become capable teammates to people, robots need the ability to interpret human activities and appropriately adjust their actions in real time. The goal of our research is to build robots that can work fluently and contingently with human teams. To this end, we have designed novel nonlinear dynamical methods to automatically model and detect synchronous joint action (SJA) in human teams. We also have extended this work to enable robots to move jointly with human teammates in real time. In this paper, we describe our work to date, and discuss our future research plans to further explore this research space. The results of this work are expected to benefit researchers in social signal processing, human-machine interaction, and robotics.
{"title":"Detecting and Synthesizing Synchronous Joint Action in Human-Robot Teams","authors":"T. Iqbal, L. Riek","doi":"10.1145/2818346.2823315","DOIUrl":"https://doi.org/10.1145/2818346.2823315","url":null,"abstract":"To become capable teammates to people, robots need the ability to interpret human activities and appropriately adjust their actions in real time. The goal of our research is to build robots that can work fluently and contingently with human teams. To this end, we have designed novel nonlinear dynamical methods to automatically model and detect synchronous joint action (SJA) in human teams. We also have extended this work to enable robots to move jointly with human teammates in real time. In this paper, we describe our work to date, and discuss our future research plans to further explore this research space. The results of this work are expected to benefit researchers in social signal processing, human-machine interaction, and robotics.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"272 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79645055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}