Mai Imamura, Ayane Tashiro, Shiro Kumano, Kazuhiro Otsuka
A framework, synergetic functional spectrum analysis (sFSA), is proposed to reveal how multimodal nonverbal behaviors such as head movements and facial expressions cooperatively perform communicative functions in conversations. We first introduce a functional spectrum to represent the functional multiplicity and ambiguity in nonverbal behaviors, e.g., a nod could imply listening, agreement, or both. More specifically, the functional spectrum is defined as the distribution of perceptual intensities of multiple functions across multiple modalities, which are based on multiple raters’ judgments. Next, the functional spectrum is decomposed into a small basis set called the synergetic functional basis, which can characterize primary and distinctive multimodal functionalities and span a synergetic functional space. Using these bases, the input spectrum is approximated as a linear combination of the bases and corresponding coefficients, which represent the coordinate in the functional space. To that purpose, this paper proposes semi-orthogonal nonnegative matrix factorization (SO-NMF) and discovers some essential multimodal synergies in the listener’s back-channel, thinking, positive responses, and speaker’s thinking and addressing. Furthermore, we proposes regression models based on convolutional neural networks (CNNs) to estimate the functional space coordinates from head movements and facial action units, and confirm the potential of the sFSA.
{"title":"Analyzing Synergetic Functional Spectrum from Head Movements and Facial Expressions in Conversations","authors":"Mai Imamura, Ayane Tashiro, Shiro Kumano, Kazuhiro Otsuka","doi":"10.1145/3577190.3614153","DOIUrl":"https://doi.org/10.1145/3577190.3614153","url":null,"abstract":"A framework, synergetic functional spectrum analysis (sFSA), is proposed to reveal how multimodal nonverbal behaviors such as head movements and facial expressions cooperatively perform communicative functions in conversations. We first introduce a functional spectrum to represent the functional multiplicity and ambiguity in nonverbal behaviors, e.g., a nod could imply listening, agreement, or both. More specifically, the functional spectrum is defined as the distribution of perceptual intensities of multiple functions across multiple modalities, which are based on multiple raters’ judgments. Next, the functional spectrum is decomposed into a small basis set called the synergetic functional basis, which can characterize primary and distinctive multimodal functionalities and span a synergetic functional space. Using these bases, the input spectrum is approximated as a linear combination of the bases and corresponding coefficients, which represent the coordinate in the functional space. To that purpose, this paper proposes semi-orthogonal nonnegative matrix factorization (SO-NMF) and discovers some essential multimodal synergies in the listener’s back-channel, thinking, positive responses, and speaker’s thinking and addressing. Furthermore, we proposes regression models based on convolutional neural networks (CNNs) to estimate the functional space coordinates from head movements and facial action units, and confirm the potential of the sFSA.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).
{"title":"Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion","authors":"Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada","doi":"10.1145/3577190.3614175","DOIUrl":"https://doi.org/10.1145/3577190.3614175","url":null,"abstract":"Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Zhao, Nan Gao, Zhi Zeng, Guixuan Zhang, Jie Liu, Shuwu Zhang
This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.
{"title":"Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment","authors":"Zeyu Zhao, Nan Gao, Zhi Zeng, Guixuan Zhang, Jie Liu, Shuwu Zhang","doi":"10.1145/3577190.3616118","DOIUrl":"https://doi.org/10.1145/3577190.3616118","url":null,"abstract":"This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, Kazushi Ikeda
This paper describes the 9th Emotion Recognition in the Wild (EmotiW) challenge, which is being run as a grand challenge at the 25th ACM International Conference on Multimodal Interaction 2023. EmotiW challenge focuses on affect related benchmarking tasks and comprises of two sub-challenges: a) User Engagement Prediction in the Wild, and b) Audio-Visual Group-based Emotion Recognition. The purpose of this challenge is to provide a common platform for researchers from diverse domains. The objective is to promote the development and assessment of methods, which can predict engagement levels and/or identify perceived emotional well-being of a group of individuals in real-world circumstances. We describe the datasets, the challenge protocols and the accompanying sub-challenge.
{"title":"EmotiW 2023: Emotion Recognition in the Wild Challenge","authors":"Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, Kazushi Ikeda","doi":"10.1145/3577190.3616545","DOIUrl":"https://doi.org/10.1145/3577190.3616545","url":null,"abstract":"This paper describes the 9th Emotion Recognition in the Wild (EmotiW) challenge, which is being run as a grand challenge at the 25th ACM International Conference on Multimodal Interaction 2023. EmotiW challenge focuses on affect related benchmarking tasks and comprises of two sub-challenges: a) User Engagement Prediction in the Wild, and b) Audio-Visual Group-based Emotion Recognition. The purpose of this challenge is to provide a common platform for researchers from diverse domains. The objective is to promote the development and assessment of methods, which can predict engagement levels and/or identify perceived emotional well-being of a group of individuals in real-world circumstances. We describe the datasets, the challenge protocols and the accompanying sub-challenge.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.
{"title":"Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation","authors":"Sun, Yasheng, Wu, Qianyi, Zhou, Hang, Wang, Kaisiyuan, Hu, Tianshu, Liao, Chen-Chieh, He, Dongliang, Liu, Jingtuo, Ding, Errui, Wang, Jingdong, Miyafuji, Shio, Liu, Ziwei, Koike, Hideki","doi":"10.1145/3577190.3614106","DOIUrl":"https://doi.org/10.1145/3577190.3614106","url":null,"abstract":"Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"268 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apostolos Kalatzis, Saidur Rahman, Vishnunarayan Girishan Prabhu, Laura Stanley, Mike Wittie
One of the primary aims of Industry 5.0 is to refine the interaction between humans, machines, and robots by developing human-centered design solutions to enhance Human-Robot Collaboration, performance, trust, and safety. This research investigated how deploying a user interface utilizing a 2-D and 3-D display affects participants’ cognitive effort, task performance, trust, and situational awareness while performing a collaborative task using a robot. The study used a within-subject design where fifteen participants were subjected to three conditions: no interface, display User Interface, and mixed reality User Interface where vision assistance was provided. Participants performed a pick-and-place task with a robot in each condition under two levels of cognitive workload (i.e., high and low). The cognitive workload was measured using subjective (i.e., NASA TLX) and objective measures (i.e., heart rate variability). Additionally, task performance, situation awareness, and trust when using these interfaces were measured to understand the impact of different user interfaces during a Human-Robot Collaboration task. Findings from this study indicated that cognitive workload and user interfaces impacted task performance, where a significant decrease in efficiency and accuracy was observed while using the mixed reality interface. Additionally, irrespective of the three conditions, all participants perceived the task as more cognitively demanding during the high cognitive workload session. However, no significant differences across the interfaces were observed. Finally, cognitive workload impacted situational awareness and trust, where lower levels were reported in the high cognitive workload session, and the lowest levels were observed under the mixed reality user interface condition.
{"title":"A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration","authors":"Apostolos Kalatzis, Saidur Rahman, Vishnunarayan Girishan Prabhu, Laura Stanley, Mike Wittie","doi":"10.1145/3577190.3614112","DOIUrl":"https://doi.org/10.1145/3577190.3614112","url":null,"abstract":"One of the primary aims of Industry 5.0 is to refine the interaction between humans, machines, and robots by developing human-centered design solutions to enhance Human-Robot Collaboration, performance, trust, and safety. This research investigated how deploying a user interface utilizing a 2-D and 3-D display affects participants’ cognitive effort, task performance, trust, and situational awareness while performing a collaborative task using a robot. The study used a within-subject design where fifteen participants were subjected to three conditions: no interface, display User Interface, and mixed reality User Interface where vision assistance was provided. Participants performed a pick-and-place task with a robot in each condition under two levels of cognitive workload (i.e., high and low). The cognitive workload was measured using subjective (i.e., NASA TLX) and objective measures (i.e., heart rate variability). Additionally, task performance, situation awareness, and trust when using these interfaces were measured to understand the impact of different user interfaces during a Human-Robot Collaboration task. Findings from this study indicated that cognitive workload and user interfaces impacted task performance, where a significant decrease in efficiency and accuracy was observed while using the mixed reality interface. Additionally, irrespective of the three conditions, all participants perceived the task as more cognitively demanding during the high cognitive workload session. However, no significant differences across the interfaces were observed. Finally, cognitive workload impacted situational awareness and trust, where lower levels were reported in the high cognitive workload session, and the lowest levels were observed under the mixed reality user interface condition.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué
This paper explores privacy-compliant group-level emotion recognition "in-the-wild" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
{"title":"Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features","authors":"Anderson Augusma, Dominique Vaufreydaz, Frédérique Letué","doi":"10.1145/3577190.3616546","DOIUrl":"https://doi.org/10.1145/3577190.3616546","url":null,"abstract":"This paper explores privacy-compliant group-level emotion recognition \"in-the-wild\" within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. Spatial-temporal transformers trained in the supervised framework of Classifier-Induced Stopping outperform exploration-based methods. We show our new method yields experimental AUC advantages of up to 8.7%.
{"title":"Early Classifying Multimodal Sequences","authors":"Alexander Cao, Jean Utke, Diego Klabjan","doi":"10.1145/3577190.3614163","DOIUrl":"https://doi.org/10.1145/3577190.3614163","url":null,"abstract":"Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. Spatial-temporal transformers trained in the supervised framework of Classifier-Induced Stopping outperform exploration-based methods. We show our new method yields experimental AUC advantages of up to 8.7%.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.
{"title":"FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning","authors":"Kazi Injamamul Haque, Zerrin Yumak","doi":"10.1145/3577190.3614157","DOIUrl":"https://doi.org/10.1145/3577190.3614157","url":null,"abstract":"This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park
Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.
{"title":"HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer","authors":"Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park","doi":"10.1145/3577190.3614122","DOIUrl":"https://doi.org/10.1145/3577190.3614122","url":null,"abstract":"Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}