首页 > 最新文献

Companion Publication of the 2020 International Conference on Multimodal Interaction最新文献

英文 中文
FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation 语音驱动手势生成的自回归行为克隆
Leon Harz, Hendric Voß, Stefan Kopp
Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.
人类的交流依赖于多种方式,如语言表达、面部暗示和身体手势。开发处理和生成这些多模态信号的计算方法对于无缝人机交互至关重要。一个特别的挑战是生成协同语音手势,因为伴随口头话语的手势数量和变化很大,导致一对多映射问题。本文提出了一种基于特征提取注入网络(FEIN-Z)的方法,该方法采用了机器人模仿学习的见解,并将其应用于协同语音手势生成。基于BC-Z架构,我们的框架结合了变压器架构和Wasserstein生成对抗网络。我们描述了在GENEA挑战2023中获得的FEIN-Z方法和评估结果,显示出良好的结果,并在GENEA基线上显着改善了人类相似性。我们讨论了潜在的改进领域,例如改进输入分割,采用更细粒度的控制网络,以及探索替代推理方法。
{"title":"FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation","authors":"Leon Harz, Hendric Voß, Stefan Kopp","doi":"10.1145/3577190.3616115","DOIUrl":"https://doi.org/10.1145/3577190.3616115","url":null,"abstract":"Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning 多吵才算太吵?数据噪声对协同学习中混淆和冲突多模态识别的影响
Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel
Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.
支持协作学习的智能系统依赖于实时行为数据,包括语言、音频和视频。然而,噪声数据,如语音识别中的单词错误,音频静态或背景噪声,以及视频中的面部跟踪错误,通常限制了多模态数据的使用。面对大量的数据噪声,我们如何建立可靠的多模态模型是一个悬而未决的问题。在本文中,我们研究了数据噪声对25对小学学习者在协作编程会话中识别困惑和冲突时刻的影响。我们用单词错误率(WER)来衡量语言错误,用语音噪声比(SNR)来衡量音频噪声,用逐帧面部跟踪精度来衡量视频错误。结果表明,当WER超过20%时,该模型对语态混淆和冲突的检测准确率从0.84急剧下降到0.73。同样,在音频模态中,当信噪比低于5 dB时,模型的精度从0.79急剧下降到0.61。相反,只要至少有一个学习者的脸被成功跟踪,该模型的准确性在视频模式下保持相对稳定的水平(>.70)。此外,我们训练了多个多模态模型,发现整合多模态数据可以有效抵消单模态数据中噪声的负面影响,最终提高了识别混淆和冲突的准确性。这些发现对未来在实际课堂环境中部署支持协作学习的智能系统具有实际意义。
{"title":"How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning","authors":"Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel","doi":"10.1145/3577190.3614127","DOIUrl":"https://doi.org/10.1145/3577190.3614127","url":null,"abstract":"Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction 对话中基于视频的呼吸波形估计:人机交互的新任务和数据集
Takao Obi, Kotaro Funakoshi
Respiration is closely related to speech, so respiratory information is useful for improving human-machine multimodal spoken interaction from various perspectives. A machine-learning task is presented for multimodal interactive systems to improve the compatibility of the systems and promote smooth interaction with them. This “video-based respiration waveform estimation (VRWE)” task consists of two subtasks: waveform amplitude estimation and waveform gradient estimation. A dataset consisting of respiratory data for 30 participants was created for this task, and a strong baseline method based on 3DCNN-ConvLSTM was evaluated on the dataset. Finally, VRWE, especially gradient estimation, was shown to be effective in predicting user voice activity after 200 ms. These results suggest that VRWE is effective for improving human-machine multimodal interaction.
呼吸与语音密切相关,因此呼吸信息可以从多个角度改善人机多模态语音交互。提出了一种多模态交互系统的机器学习任务,以提高系统的兼容性,促进系统之间的平滑交互。该“基于视频的呼吸波形估计(VRWE)”任务包括波形幅度估计和波形梯度估计两个子任务。为此创建了一个由30名参与者的呼吸数据组成的数据集,并在该数据集上评估了基于3DCNN-ConvLSTM的强基线方法。最后,VRWE,特别是梯度估计,在预测200 ms后的用户语音活动方面是有效的。这些结果表明VRWE在改善人机多模态交互方面是有效的。
{"title":"Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction","authors":"Takao Obi, Kotaro Funakoshi","doi":"10.1145/3577190.3614154","DOIUrl":"https://doi.org/10.1145/3577190.3614154","url":null,"abstract":"Respiration is closely related to speech, so respiratory information is useful for improving human-machine multimodal spoken interaction from various perspectives. A machine-learning task is presented for multimodal interactive systems to improve the compatibility of the systems and promote smooth interaction with them. This “video-based respiration waveform estimation (VRWE)” task consists of two subtasks: waveform amplitude estimation and waveform gradient estimation. A dataset consisting of respiratory data for 30 participants was created for this task, and a strong baseline method based on 3DCNN-ConvLSTM was evaluated on the dataset. Finally, VRWE, especially gradient estimation, was shown to be effective in predicting user voice activity after 200 ms. These results suggest that VRWE is effective for improving human-machine multimodal interaction.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach 脑性瘫痪患者步态事件预测的特征不确定性:一种低成本方法
Saikat Chakraborty, Noble Thomas, Anup Nandy
Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.
在模型构建过程中引入特征不确定性,探索了该模型的真正泛化能力。但在脑瘫患者步态事件自动检测中,这一因素往往被忽略。同样,目前流行的基于视觉的步态事件检测系统由于包含高端运动跟踪摄像机而价格昂贵。本研究提出了一种低成本的步态事件检测系统,用于脚后跟撞击和脚趾脱落事件。构建状态空间模型,通过量化特征不确定性设计步态信号的时间演化。使用Cardiff分类器对模型进行训练。踝关节速度作为输入特征。与状态转换相关的帧被标记为步态事件。该模型在15例脑瘫患者和15例正常人身上进行了实验。数据采集使用低成本Kinect摄像头。该模型以平均2帧误差识别步态事件。所有的事件都在实际发生之前就被预测到了。脚趾着地的误差小于脚跟着地的误差。步态事件检测中不确定性因素的结合表现出相对于最先进的竞争性能。
{"title":"Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach","authors":"Saikat Chakraborty, Noble Thomas, Anup Nandy","doi":"10.1145/3577190.3614125","DOIUrl":"https://doi.org/10.1145/3577190.3614125","url":null,"abstract":"Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"36 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures µGeT:结合触摸交互和微手势的多模态文本选择技术
Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay
We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.
我们提出了一种新的多模态无眼文本选择技术µGeT。µGeT结合了触摸交互和微手势。µGeT扩展了触摸屏设备的输入带宽,从而缩短了日常任务的交互路径,特别适合视障人士(PVI)。为此,µGeT扩展了触摸交互(左/右和上/下轻弹),使用两个简单的微手势:拇指触摸食指或中指。对于文本选择,多模态技术允许我们直接修改两个选择手柄的位置和文本选择的粒度。两项用户研究,一项是9个PVI,另一项是8个蒙眼视力正常的人,将µGeT与基线常用技术(如iPhone上的VoiceOver)进行比较。尽管性能差异很大,但两项用户研究表明,µGeT在全局上比VoiceOver更快,产生的错误更少。对互动轨迹的详细分析突出了参与者采用的不同策略。除了文本选择之外,这项研究还显示了结合触摸交互和微手势来提高触摸屏设备对PVI的可访问性的潜力。
{"title":"µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures","authors":"Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay","doi":"10.1145/3577190.3614131","DOIUrl":"https://doi.org/10.1145/3577190.3614131","url":null,"abstract":"We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Feedback Modality Designs to Improve Young Children's Collaborative Actions 探索反馈模式设计以提高幼儿的协作行为
Amy Melniczuk, Egesa Vrapi
Tangible user interfaces offer the benefit of incorporating physical aspects in the interaction with digital systems, enriching how system information can be conveyed. We investigated how visual, haptic, and audio modalities influence young children’s joint actions. We used a design-based research method to design and develop a multi-sensory tangible device. Two kindergarten teachers and 31 children were involved in our design process. We tested the final prototype with 20 children aged 5-6 from three kindergartens. The main findings were: a) involving and getting approval from kindergarten teachers in the design process was essential; b) simultaneously providing visual and audio feedback might help improve children’s collaborative actions. Our study was an interdisciplinary research on human-computer interaction and children’s education, which contributed an empirical understanding of the factors influencing children collaboration and communication.
有形的用户界面提供了在与数字系统的交互中结合物理方面的好处,丰富了系统信息的传递方式。我们调查了视觉、触觉和听觉模式如何影响幼儿的联合行动。我们采用基于设计的研究方法来设计和开发一个多感官的有形设备。两名幼儿园老师和31名孩子参与了我们的设计过程。我们对来自三所幼儿园的20名5-6岁的儿童进行了最终的原型测试。主要发现有:a)在设计过程中,让幼儿园老师参与并获得他们的认可是至关重要的;B)同时提供视觉和音频反馈可能有助于提高儿童的协作行为。本研究是一项人机交互与儿童教育的跨学科研究,对影响儿童协作与沟通的因素有了实证认识。
{"title":"Exploring Feedback Modality Designs to Improve Young Children's Collaborative Actions","authors":"Amy Melniczuk, Egesa Vrapi","doi":"10.1145/3577190.3614140","DOIUrl":"https://doi.org/10.1145/3577190.3614140","url":null,"abstract":"Tangible user interfaces offer the benefit of incorporating physical aspects in the interaction with digital systems, enriching how system information can be conveyed. We investigated how visual, haptic, and audio modalities influence young children’s joint actions. We used a design-based research method to design and develop a multi-sensory tangible device. Two kindergarten teachers and 31 children were involved in our design process. We tested the final prototype with 20 children aged 5-6 from three kindergartens. The main findings were: a) involving and getting approval from kindergarten teachers in the design process was essential; b) simultaneously providing visual and audio feedback might help improve children’s collaborative actions. Our study was an interdisciplinary research on human-computer interaction and children’s education, which contributed an empirical understanding of the factors influencing children collaboration and communication.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Out of Sight,... How Asymmetry in Video-Conference Affects Social Interaction 看不见了,……视频会议中的不对称如何影响社交互动
Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers
As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.
随着社交媒介互动变得越来越重要和多模式,甚至扩展到虚拟现实和机器人化身的物理远程呈现,新的挑战出现了。例如,视频通话已经成为常态,人们越来越普遍地经历一种形式的不对称,例如由于连接问题而无法听到或看到他们的通信伙伴在线。以前的研究还没有广泛探讨对社会互动的影响。在这项研究中,61对夫妇,即122名成年人,使用视频会议平台玩了一个类似测验的游戏,并通过测量社会存在的五个子量表来评估他们的社会互动质量。二人组可以对称地获取社交线索(只有音频,或者音频和视频),也可以不对称地获取社交线索(一方只接收音频,另一方只接收音频和视频)。我们的研究结果表明,在不对称访问的情况下,从另一方接收到更多形式(即音频和视频)的一方,明显比他们的伴侣感到更少的联系。我们将这些结果与媒体丰富度理论(MRT)和超个人模型进行讨论:在不对称中,更多的形式或线索不一定会增加社会联系感,这与MRT预测的相反。我们假设,发送更少线索的参与者通过增加他们表达的丰富程度来补偿,并且互动向两个参与者的同等丰富程度转变。
{"title":"Out of Sight,... How Asymmetry in Video-Conference Affects Social Interaction","authors":"Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers","doi":"10.1145/3577190.3614168","DOIUrl":"https://doi.org/10.1145/3577190.3614168","url":null,"abstract":"As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams 基于语音模式的人- agent团队合作维度建模
Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield
Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.
团队性是一个新提出的多维结构,旨在描述团队及其随时间的动态相互依赖水平。具体来说,团队精神深深植根于团队认知文献,考虑到团队的组成、过程、状态和行动如何影响协作。随着这个多面结构最近被提出,有一个呼吁研究界调查、测量和建模团队的维度。在这项研究中,我们探索了21个人-人-智能体团队在远程协同搜索任务中的语音内容。通过在整个任务过程中对他们的社会和情感状态进行自我报告调查,我们进行了因素分析,将调查措施浓缩为与团队框架中概述的维度密切相关的四个组成部分:社会动态和信任、影响、认知负荷和人际依赖。然后,我们使用语言调查和单词计数(LIWC)从团队演讲中提取特征,并对这四个团队合作组成部分以及团队绩效进行认知网络分析(ENA)。我们通过ENA分析研究了特定LIWC特征与自我报告的团队流程和绩效之间的关系,并提出了六个假设。通过对网络的定量和定性分析,我们探索了四个组成部分之间的语音模式差异,并将这些发现与团队性维度联系起来。我们的研究结果表明,基于选定的LIWC特征的ENA模型能够捕获团队性和团队绩效的元素;因此,该技术有望在CSCW期间对这些状态进行建模,最终设计出使用基于语音的措施来促进更大团队合作的智能系统。
{"title":"Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams","authors":"Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield","doi":"10.1145/3577190.3614121","DOIUrl":"https://doi.org/10.1145/3577190.3614121","url":null,"abstract":"Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Projecting life onto machines 将生命投射到机器上
Simone Natale
Public discussions and imaginaries about AI often center around the idea that technologies such as neural networks might one day lead to the emergence of machines that think or even feel like humans. Drawing on histories of how people project lives onto talking things, from spiritualist seances in the Victorian era to contemporary advances in robotics, this talk argues that the “lives” of AI have more to do with how humans perceive and relate to machines exhibiting communicative behavior, than with the functioning of computing technologies in itself. Taking up this point of view helps acknowledge and further interrogate how perceptions and cultural representations inform the outcome of technologies that are programmed to interact and communicate with human users.
公众对人工智能的讨论和想象往往围绕着这样一个想法,即神经网络等技术有朝一日可能会导致像人类一样思考甚至感觉的机器的出现。从维多利亚时代的灵媒降神会到当代机器人技术的进步,从人类如何将生命投射到会说话的事物上的历史来看,这篇演讲认为,人工智能的“生命”更多地与人类如何感知和联系表现出交流行为的机器有关,而不是与计算技术本身的功能有关。接受这一观点有助于承认并进一步询问感知和文化表征如何告知被编程为与人类用户交互和交流的技术的结果。
{"title":"Projecting life onto machines","authors":"Simone Natale","doi":"10.1145/3577190.3616522","DOIUrl":"https://doi.org/10.1145/3577190.3616522","url":null,"abstract":"Public discussions and imaginaries about AI often center around the idea that technologies such as neural networks might one day lead to the emergence of machines that think or even feel like humans. Drawing on histories of how people project lives onto talking things, from spiritualist seances in the Victorian era to contemporary advances in robotics, this talk argues that the “lives” of AI have more to do with how humans perceive and relate to machines exhibiting communicative behavior, than with the functioning of computing technologies in itself. Taking up this point of view helps acknowledge and further interrogate how perceptions and cultural representations inform the outcome of technologies that are programmed to interact and communicate with human users.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents ReNeLiB:基于社会交互agent的实时神经倾听行为生成
Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard
Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib
对人类行为的灵活和自然的非语言反应仍然是社会互动代理(SIAs)的挑战,这些代理主要使用手工制作的规则进行动画。虽然最近提出的基于机器学习的会话行为生成方法是解决这一挑战的有希望的方法,但它们尚未在sia中使用。造成这种情况的主要原因是缺乏将这种方法与SIA框架集成在一起的软件工具包,该工具包符合人机交互场景的具有挑战性的实时需求。在我们的工作中,我们首次提出了一个由三个主要部分组成的工具包:(1)实时特征提取,捕获来自用户的多模态社交线索;(2)基于最新神经网络方法的行为生成;(3)生成行为的可视化,支持基于flame和基于Apple arkit的交互代理。我们全面评估了整个框架及其组件的实时性能。此外,我们还引入了来自心理治疗课程的预训练行为生成模型,用于特定领域的倾听行为。我们的软件工具包对实时部署和评估SIAs的监听行为至关重要,是公开的。资源,包括代码,从治疗相互作用中提取的行为多模态特征,托管于https://daksitha.github.io/ReNeLib
{"title":"ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents","authors":"Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard","doi":"10.1145/3577190.3614133","DOIUrl":"https://doi.org/10.1145/3577190.3614133","url":null,"abstract":"Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Companion Publication of the 2020 International Conference on Multimodal Interaction
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1