Companion Publication of the 2020 International Conference on Multimodal Interaction最新文献_第2页

Analyzing Synergetic Functional Spectrum from Head Movements and Facial Expressions in Conversations 对话中头部动作和面部表情的协同功能谱分析

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614153

Mai Imamura, Ayane Tashiro, Shiro Kumano, Kazuhiro Otsuka

A framework, synergetic functional spectrum analysis (sFSA), is proposed to reveal how multimodal nonverbal behaviors such as head movements and facial expressions cooperatively perform communicative functions in conversations. We first introduce a functional spectrum to represent the functional multiplicity and ambiguity in nonverbal behaviors, e.g., a nod could imply listening, agreement, or both. More specifically, the functional spectrum is defined as the distribution of perceptual intensities of multiple functions across multiple modalities, which are based on multiple raters’ judgments. Next, the functional spectrum is decomposed into a small basis set called the synergetic functional basis, which can characterize primary and distinctive multimodal functionalities and span a synergetic functional space. Using these bases, the input spectrum is approximated as a linear combination of the bases and corresponding coefficients, which represent the coordinate in the functional space. To that purpose, this paper proposes semi-orthogonal nonnegative matrix factorization (SO-NMF) and discovers some essential multimodal synergies in the listener’s back-channel, thinking, positive responses, and speaker’s thinking and addressing. Furthermore, we proposes regression models based on convolutional neural networks (CNNs) to estimate the functional space coordinates from head movements and facial action units, and confirm the potential of the sFSA.

本文提出了协同功能谱分析(sFSA)框架来揭示头部动作和面部表情等多模态非语言行为在会话中如何协同执行交际功能。我们首先引入功能谱来表示非语言行为的功能多样性和模糊性，例如，点头可能意味着倾听，同意，或两者兼而有之。更具体地说，功能谱被定义为多个功能在多个模态上的感知强度分布，这是基于多个评分者的判断。其次，将功能谱分解为一个小的基集，称为协同功能基，该基集可以表征主要和独特的多模态功能，并跨越协同功能空间。利用这些基，将输入谱近似为基和相应系数的线性组合，表示函数空间中的坐标。为此，本文提出了半正交非负矩阵分解法(SO-NMF)，并发现了听者的反向通道、思维、积极回应以及说话者的思维和定位中存在一些重要的多模态协同作用。此外，我们提出了基于卷积神经网络(cnn)的回归模型来估计头部运动和面部动作单元的功能空间坐标，并确认了sFSA的潜力。

{"title":"Analyzing Synergetic Functional Spectrum from Head Movements and Facial Expressions in Conversations","authors":"Mai Imamura, Ayane Tashiro, Shiro Kumano, Kazuhiro Otsuka","doi":"10.1145/3577190.3614153","DOIUrl":"https://doi.org/10.1145/3577190.3614153","url":null,"abstract":"A framework, synergetic functional spectrum analysis (sFSA), is proposed to reveal how multimodal nonverbal behaviors such as head movements and facial expressions cooperatively perform communicative functions in conversations. We first introduce a functional spectrum to represent the functional multiplicity and ambiguity in nonverbal behaviors, e.g., a nod could imply listening, agreement, or both. More specifically, the functional spectrum is defined as the distribution of perceptual intensities of multiple functions across multiple modalities, which are based on multiple raters’ judgments. Next, the functional spectrum is decomposed into a small basis set called the synergetic functional basis, which can characterize primary and distinctive multimodal functionalities and span a synergetic functional space. Using these bases, the input spectrum is approximated as a linear combination of the bases and corresponding coefficients, which represent the coordinate in the functional space. To that purpose, this paper proposes semi-orthogonal nonnegative matrix factorization (SO-NMF) and discovers some essential multimodal synergies in the listener’s back-channel, thinking, positive responses, and speaker’s thinking and addressing. Furthermore, we proposes regression models based on convolutional neural networks (CNNs) to estimate the functional space coordinates from head movements and facial action units, and confirm the potential of the sFSA.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion 面向虚拟角色运动语义级生成与编辑的帧级事件表示学习

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614175

Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada

Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).

理解角色的动作和控制其内容对于内容创作非常重要，并且在计算机视觉和图形学中得到了积极的研究。角色的动作由代表每次姿势的帧组成，而随后的帧序列可以根据语义组合成一个片段。为了实现语义层面的运动控制，理解角色运动的语义划分是很重要的。我们将角色运动的语义划分定义为一个“事件”，只有当动作中的帧不能从前一帧和最后一个事件的信息中预测时，它才会切换，并解决了基于事件的编辑动作和从文本中推断动作的问题。但是，我们需要获取事件信息，并根据获得的事件信息来控制运动的内容，这是一个具有挑战性的问题。为了克服这一挑战，我们提出从运动和文本对中获得帧级事件表示，并使用它来编辑运动中的事件和从文本中预测运动。具体来说，我们通过从相应的帧级事件表示序列重构角色的运动来学习帧级事件表示，同时从文本中推断序列。通过这样做，我们可以从文本中预测运动。此外，由于每个运动帧中的事件都用相应的事件表示表示，因此我们可以通过编辑相应的事件表示序列来编辑运动中的事件。我们在HumanML3D数据集上评估了我们的方法，并证明了我们的模型可以在灵活编辑运动的同时从文本中生成运动(例如，允许改变事件持续时间、修改事件特征和添加新事件)。

{"title":"Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion","authors":"Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada","doi":"10.1145/3577190.3614175","DOIUrl":"https://doi.org/10.1145/3577190.3614175","url":null,"abstract":"Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment 手势运动图为少数镜头语音驱动的手势再现

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616118

Zeyu Zhao, Nan Gao, Zhi Zeng, Guixuan Zhang, Jie Liu, Shuwu Zhang

This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.

本文介绍了CASIA-GO进入嵌入式代理非语言行为的生成和评估(GENEA)挑战2023。该系统最初是为少数镜头场景设计的，例如从简短的语音样本中生成具有任何野外目标说话者风格的手势。给定一组包含手势序列、音频和文本的参考语音数据，首先构建一个描述语音内部软手势单元和帧间连续性的手势运动图，准备在提供测试音频和文本时通过寻路进行新的节奏和语义手势再现。我们从训练数据中随机选择一个片段作为一个测试片段来模拟几次射击的场景，并为主观评价提供兼容的结果。尽管测试集中每个片段的整个训练集的平均利用率为0.25%，整个测试集的训练集的总利用率为17.5%，但系统成功地提供了有效的结果，并且在智能体语音评估的适当性方面排名前1/3。

{"title":"Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment","authors":"Zeyu Zhao, Nan Gao, Zhi Zeng, Guixuan Zhang, Jie Liu, Shuwu Zhang","doi":"10.1145/3577190.3616118","DOIUrl":"https://doi.org/10.1145/3577190.3616118","url":null,"abstract":"This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

EmotiW 2023: Emotion Recognition in the Wild Challenge EmotiW 2023:野生挑战中的情感识别

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616545

Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, Kazushi Ikeda

This paper describes the 9th Emotion Recognition in the Wild (EmotiW) challenge, which is being run as a grand challenge at the 25th ACM International Conference on Multimodal Interaction 2023. EmotiW challenge focuses on affect related benchmarking tasks and comprises of two sub-challenges: a) User Engagement Prediction in the Wild, and b) Audio-Visual Group-based Emotion Recognition. The purpose of this challenge is to provide a common platform for researchers from diverse domains. The objective is to promote the development and assessment of methods, which can predict engagement levels and/or identify perceived emotional well-being of a group of individuals in real-world circumstances. We describe the datasets, the challenge protocols and the accompanying sub-challenge.

本文描述了第9次野外情感识别(EmotiW)挑战，该挑战将在2023年第25届ACM国际多模式交互会议上作为重大挑战进行。EmotiW挑战侧重于影响相关的基准测试任务，包括两个子挑战:a)野外用户参与度预测，b)基于视听群体的情感识别。这项挑战的目的是为来自不同领域的研究人员提供一个共同的平台。目标是促进方法的发展和评估，这些方法可以预测参与水平和/或识别现实环境中一群人的感知情感健康。我们描述了数据集、挑战协议和伴随的子挑战。

引用次数: 2

Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation 使您的简短笔画真实和立体:3d感知简化素描到肖像生成

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614106

Sun, Yasheng, Wu, Qianyi, Zhou, Hang, Wang, Kaisiyuan, Hu, Tianshu, Liao, Chen-Chieh, He, Dongliang, Liu, Jingtuo, Ding, Errui, Wang, Jingdong, Miyafuji, Shio, Liu, Ziwei, Koike, Hideki

Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.

创造人们素描肖像的照片写实版本对各种娱乐目的都很有用。现有的研究只能在固定视图的二维平面上生成肖像，使得结果不那么生动。在本文中，我们提出了立体简化素描到肖像(SSSP)，它探讨了通过涉及3D生成模型，从简单的轮廓草图创建立体3D感知肖像的可能性。我们的关键见解是设计草图感知约束，可以充分利用基于三平面的3d感知生成模型的先验知识。具体而言，我们设计的区域感知体绘制策略和全局一致性约束进一步增强了草图编码过程中的细节对应性。此外，为了方便外行用户的使用，我们提出了一个矢量量化表示的轮廓到素描模块，方便绘制的轮廓可以直接指导三维肖像的生成。广泛的比较表明，我们的方法产生的高质量结果与草图相匹配。我们的可用性研究证实了我们的系统是用户的首选。

{"title":"Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation","authors":"Sun, Yasheng, Wu, Qianyi, Zhou, Hang, Wang, Kaisiyuan, Hu, Tianshu, Liao, Chen-Chieh, He, Dongliang, Liu, Jingtuo, Ding, Errui, Wang, Jingdong, Miyafuji, Shio, Liu, Ziwei, Koike, Hideki","doi":"10.1145/3577190.3614106","DOIUrl":"https://doi.org/10.1145/3577190.3614106","url":null,"abstract":"Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration 多模态方法研究认知负荷和用户界面在人机协作中的作用

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614112

Apostolos Kalatzis, Saidur Rahman, Vishnunarayan Girishan Prabhu, Laura Stanley, Mike Wittie

One of the primary aims of Industry 5.0 is to refine the interaction between humans, machines, and robots by developing human-centered design solutions to enhance Human-Robot Collaboration, performance, trust, and safety. This research investigated how deploying a user interface utilizing a 2-D and 3-D display affects participants’ cognitive effort, task performance, trust, and situational awareness while performing a collaborative task using a robot. The study used a within-subject design where fifteen participants were subjected to three conditions: no interface, display User Interface, and mixed reality User Interface where vision assistance was provided. Participants performed a pick-and-place task with a robot in each condition under two levels of cognitive workload (i.e., high and low). The cognitive workload was measured using subjective (i.e., NASA TLX) and objective measures (i.e., heart rate variability). Additionally, task performance, situation awareness, and trust when using these interfaces were measured to understand the impact of different user interfaces during a Human-Robot Collaboration task. Findings from this study indicated that cognitive workload and user interfaces impacted task performance, where a significant decrease in efficiency and accuracy was observed while using the mixed reality interface. Additionally, irrespective of the three conditions, all participants perceived the task as more cognitively demanding during the high cognitive workload session. However, no significant differences across the interfaces were observed. Finally, cognitive workload impacted situational awareness and trust, where lower levels were reported in the high cognitive workload session, and the lowest levels were observed under the mixed reality user interface condition.

工业5.0的主要目标之一是通过开发以人为中心的设计解决方案来改进人、机器和机器人之间的交互，以增强人与机器人之间的协作、性能、信任和安全性。本研究调查了使用2-D和3-D显示的用户界面如何影响参与者在使用机器人执行协作任务时的认知努力、任务绩效、信任和态势感知。该研究采用了主题内设计，15名参与者受到三种条件的影响:无界面、显示用户界面和提供视觉辅助的混合现实用户界面。在两种认知负荷水平(即高负荷和低负荷)下，参与者在每个条件下与机器人一起执行拾取和放置任务。认知负荷测量采用主观测量(即NASA TLX)和客观测量(即心率变异性)。此外，还测量了使用这些界面时的任务性能、情况感知和信任，以了解在人机协作任务期间不同用户界面的影响。本研究的结果表明，认知工作量和用户界面影响任务性能，其中使用混合现实界面时观察到效率和准确性显着下降。此外，无论在这三种情况下，所有参与者都认为在高认知负荷阶段，任务的认知要求更高。然而，在不同的界面上没有观察到显著的差异。最后，认知工作量对情境感知和信任有影响，高认知工作量的情境感知和信任水平较低，混合现实用户界面的情境感知和信任水平最低。

{"title":"A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration","authors":"Apostolos Kalatzis, Saidur Rahman, Vishnunarayan Girishan Prabhu, Laura Stanley, Mike Wittie","doi":"10.1145/3577190.3614112","DOIUrl":"https://doi.org/10.1145/3577190.3614112","url":null,"abstract":"One of the primary aims of Industry 5.0 is to refine the interaction between humans, machines, and robots by developing human-centered design solutions to enhance Human-Robot Collaboration, performance, trust, and safety. This research investigated how deploying a user interface utilizing a 2-D and 3-D display affects participants’ cognitive effort, task performance, trust, and situational awareness while performing a collaborative task using a robot. The study used a within-subject design where fifteen participants were subjected to three conditions: no interface, display User Interface, and mixed reality User Interface where vision assistance was provided. Participants performed a pick-and-place task with a robot in each condition under two levels of cognitive workload (i.e., high and low). The cognitive workload was measured using subjective (i.e., NASA TLX) and objective measures (i.e., heart rate variability). Additionally, task performance, situation awareness, and trust when using these interfaces were measured to understand the impact of different user interfaces during a Human-Robot Collaboration task. Findings from this study indicated that cognitive workload and user interfaces impacted task performance, where a significant decrease in efficiency and accuracy was observed while using the mixed reality interface. Additionally, irrespective of the three conditions, all participants perceived the task as more cognitively demanding during the high cognitive workload session. However, no significant differences across the interfaces were observed. Finally, cognitive workload impacted situational awareness and trust, where lower levels were reported in the high cognitive workload session, and the lowest levels were observed under the mixed reality user interface condition.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features 使用隐私兼容特征的野外多模态群体情感识别

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616546

Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué

This paper explores privacy-compliant group-level emotion recognition "in-the-wild" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.

本文探讨了在EmotiW挑战2023中符合隐私的群体级情感识别“野外”。群体层面的情感识别在许多领域都很有用，包括社交机器人、会话代理、电子教练和学习分析。本研究只使用全局特征，避免使用个体特征，即所有可用于识别或跟踪视频中的人的特征(面部地标，身体姿势，音频化等)。所提出的多模态模型由一个视频和一个音频分支组成，在模态之间具有交叉关注。视频分支基于经过微调的ViT架构。音频分支提取mel频谱图，并将其通过CNN块馈送到变压器编码器。我们的训练范例包括一个生成的合成数据集，以数据驱动的方式提高我们的模型对图像中面部表情的敏感性。大量的实验表明了我们的方法的意义。我们的隐私兼容提案在EmotiW挑战中表现相当好，对于最佳模型，验证集和测试集的准确率分别为79.24%和75.13%。值得注意的是，我们的研究结果强调，仅使用均匀分布在视频上的5帧，就可以通过隐私兼容功能达到这种精度水平。

{"title":"Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features","authors":"Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué","doi":"10.1145/3577190.3616546","DOIUrl":"https://doi.org/10.1145/3577190.3616546","url":null,"abstract":"This paper explores privacy-compliant group-level emotion recognition \"in-the-wild\" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Early Classifying Multimodal Sequences 多模态序列的早期分类

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614163

Alexander Cao, Jean Utke, Diego Klabjan

Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. Spatial-temporal transformers trained in the supervised framework of Classifier-Induced Stopping outperform exploration-based methods. We show our new method yields experimental AUC advantages of up to 8.7%.

通常，随着时间的推移，信息是按顺序接收的。什么时候能收集到足够多的碎片来分类?交易等待时间的决策确定性导致早期的分类问题，最近得到关注的一种手段，使分类适应更动态的环境。然而，到目前为止，结果仅限于单峰序列。在本初步研究中，我们结合现有方法扩展到多模态序列的早期分类。在分类器诱导停止的监督框架下训练的时空变压器优于基于探索的方法。结果表明，该方法的实验AUC优势高达8.7%。

引用次数: 0

FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning FaceXHuBERT:使用自监督语音表示学习的无文本语音驱动的E(X)压制3D面部动画合成

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614157

Kazi Injamamul Haque, Zerrin Yumak

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.

本文介绍了FaceXHuBERT，一种无文本语音驱动的3D面部动画生成方法，该方法生成由情感表达条件驱动的面部线索。此外，它可以处理在各种情况下录制的音频(例如背景噪音，多人说话)。最近的方法采用端到端深度学习，将音频和文本作为输入来生成3D面部动画。然而，缺乏公开可用的具有表现力的音频- 3d面部动画数据集是一个主要的瓶颈。最终的动画在准确的对口型、情感表达、个人特定的面部线索和概括性方面仍然存在问题。在这项工作中，我们首先通过有效地使用自监督预训练的HuBERT语音模型，在语音驱动的3D面部动画生成任务上取得了比最新技术更好的结果，该模型允许在音频中合并词汇和非词汇信息，而无需使用大型词汇库。其次，我们采用二元情绪条件引导网络整合情感表达方式。我们进行了广泛的客观和主观的评估，与实际情况和最先进的技术进行比较。一项感性用户研究表明，使用我们的方法生成的富有表现力的面部动画确实被认为更真实，并且比无表现力的更受欢迎。此外，我们还表明，仅拥有一个强大的音频编码器就可以消除对网络架构的复杂解码器的需求，从而显着降低网络复杂性和训练时间。我们公开提供代码，并推荐观看视频。

{"title":"FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning","authors":"Kazi Injamamul Haque, Zerrin Yumak","doi":"10.1145/3577190.3614157","DOIUrl":"https://doi.org/10.1145/3577190.3614157","url":null,"abstract":"This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer 提示:历史，内部和个人之间的动态建模与跨人记忆转换器

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614122

Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.

情感动力学是指人类对话过程中情绪和情感表现的变化和波动，对理解人类互动至关重要。然而，由于环境因素，例如个人内部和人际依赖的复杂性和细微差别，建模影响动力学是具有挑战性的。人际依赖是指个人内部的影响和动态，包括他们的情感状态以及它如何随着时间的推移而演变。另一方面，人际依赖涉及个体之间的互动和动态，包括情感表现如何在对话中受到他人的影响和影响。为了解决这些挑战，我们提出了一个跨人记忆转换器(CPM-T)框架，该框架明确地模拟了多模态非语言线索中的个人内部和人际依赖。CPM-T框架维护内存模块来存储和更新对话的早期和后期部分之间的依赖关系。此外，我们的框架采用跨模态注意来有效地对齐来自多模态的信息，并利用跨人注意来对齐多方交互中的行为。我们在三个公开可用的数据集上评估了我们方法的有效性和鲁棒性，这些数据集用于联合参与、关系和人类信念预测任务。在这三个任务上，我们的框架在平均f1得分上分别比基线模型高出22.6%、15.1%和10.0%。最后，我们通过对多模态时间行为的消融研究证明了框架中每个组成部分的重要性。

{"title":"HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer","authors":"Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park","doi":"10.1145/3577190.3614122","DOIUrl":"https://doi.org/10.1145/3577190.3614122","url":null,"abstract":"Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0