Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.
{"title":"FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation","authors":"Leon Harz, Hendric Voß, Stefan Kopp","doi":"10.1145/3577190.3616115","DOIUrl":"https://doi.org/10.1145/3577190.3616115","url":null,"abstract":"Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel
Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.
{"title":"How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning","authors":"Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel","doi":"10.1145/3577190.3614127","DOIUrl":"https://doi.org/10.1145/3577190.3614127","url":null,"abstract":"Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Respiration is closely related to speech, so respiratory information is useful for improving human-machine multimodal spoken interaction from various perspectives. A machine-learning task is presented for multimodal interactive systems to improve the compatibility of the systems and promote smooth interaction with them. This “video-based respiration waveform estimation (VRWE)” task consists of two subtasks: waveform amplitude estimation and waveform gradient estimation. A dataset consisting of respiratory data for 30 participants was created for this task, and a strong baseline method based on 3DCNN-ConvLSTM was evaluated on the dataset. Finally, VRWE, especially gradient estimation, was shown to be effective in predicting user voice activity after 200 ms. These results suggest that VRWE is effective for improving human-machine multimodal interaction.
{"title":"Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction","authors":"Takao Obi, Kotaro Funakoshi","doi":"10.1145/3577190.3614154","DOIUrl":"https://doi.org/10.1145/3577190.3614154","url":null,"abstract":"Respiration is closely related to speech, so respiratory information is useful for improving human-machine multimodal spoken interaction from various perspectives. A machine-learning task is presented for multimodal interactive systems to improve the compatibility of the systems and promote smooth interaction with them. This “video-based respiration waveform estimation (VRWE)” task consists of two subtasks: waveform amplitude estimation and waveform gradient estimation. A dataset consisting of respiratory data for 30 participants was created for this task, and a strong baseline method based on 3DCNN-ConvLSTM was evaluated on the dataset. Finally, VRWE, especially gradient estimation, was shown to be effective in predicting user voice activity after 200 ms. These results suggest that VRWE is effective for improving human-machine multimodal interaction.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.
{"title":"Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach","authors":"Saikat Chakraborty, Noble Thomas, Anup Nandy","doi":"10.1145/3577190.3614125","DOIUrl":"https://doi.org/10.1145/3577190.3614125","url":null,"abstract":"Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"36 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay
We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.
{"title":"µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures","authors":"Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay","doi":"10.1145/3577190.3614131","DOIUrl":"https://doi.org/10.1145/3577190.3614131","url":null,"abstract":"We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tangible user interfaces offer the benefit of incorporating physical aspects in the interaction with digital systems, enriching how system information can be conveyed. We investigated how visual, haptic, and audio modalities influence young children’s joint actions. We used a design-based research method to design and develop a multi-sensory tangible device. Two kindergarten teachers and 31 children were involved in our design process. We tested the final prototype with 20 children aged 5-6 from three kindergartens. The main findings were: a) involving and getting approval from kindergarten teachers in the design process was essential; b) simultaneously providing visual and audio feedback might help improve children’s collaborative actions. Our study was an interdisciplinary research on human-computer interaction and children’s education, which contributed an empirical understanding of the factors influencing children collaboration and communication.
{"title":"Exploring Feedback Modality Designs to Improve Young Children's Collaborative Actions","authors":"Amy Melniczuk, Egesa Vrapi","doi":"10.1145/3577190.3614140","DOIUrl":"https://doi.org/10.1145/3577190.3614140","url":null,"abstract":"Tangible user interfaces offer the benefit of incorporating physical aspects in the interaction with digital systems, enriching how system information can be conveyed. We investigated how visual, haptic, and audio modalities influence young children’s joint actions. We used a design-based research method to design and develop a multi-sensory tangible device. Two kindergarten teachers and 31 children were involved in our design process. We tested the final prototype with 20 children aged 5-6 from three kindergartens. The main findings were: a) involving and getting approval from kindergarten teachers in the design process was essential; b) simultaneously providing visual and audio feedback might help improve children’s collaborative actions. Our study was an interdisciplinary research on human-computer interaction and children’s education, which contributed an empirical understanding of the factors influencing children collaboration and communication.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers
As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.
{"title":"Out of Sight,... How Asymmetry in Video-Conference Affects Social Interaction","authors":"Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers","doi":"10.1145/3577190.3614168","DOIUrl":"https://doi.org/10.1145/3577190.3614168","url":null,"abstract":"As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield
Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.
{"title":"Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams","authors":"Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield","doi":"10.1145/3577190.3614121","DOIUrl":"https://doi.org/10.1145/3577190.3614121","url":null,"abstract":"Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Public discussions and imaginaries about AI often center around the idea that technologies such as neural networks might one day lead to the emergence of machines that think or even feel like humans. Drawing on histories of how people project lives onto talking things, from spiritualist seances in the Victorian era to contemporary advances in robotics, this talk argues that the “lives” of AI have more to do with how humans perceive and relate to machines exhibiting communicative behavior, than with the functioning of computing technologies in itself. Taking up this point of view helps acknowledge and further interrogate how perceptions and cultural representations inform the outcome of technologies that are programmed to interact and communicate with human users.
{"title":"Projecting life onto machines","authors":"Simone Natale","doi":"10.1145/3577190.3616522","DOIUrl":"https://doi.org/10.1145/3577190.3616522","url":null,"abstract":"Public discussions and imaginaries about AI often center around the idea that technologies such as neural networks might one day lead to the emergence of machines that think or even feel like humans. Drawing on histories of how people project lives onto talking things, from spiritualist seances in the Victorian era to contemporary advances in robotics, this talk argues that the “lives” of AI have more to do with how humans perceive and relate to machines exhibiting communicative behavior, than with the functioning of computing technologies in itself. Taking up this point of view helps acknowledge and further interrogate how perceptions and cultural representations inform the outcome of technologies that are programmed to interact and communicate with human users.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard
Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib
{"title":"ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents","authors":"Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard","doi":"10.1145/3577190.3614133","DOIUrl":"https://doi.org/10.1145/3577190.3614133","url":null,"abstract":"Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}