Proceedings of the 2020 International Conference on Multimodal Interaction最新文献_第3页

Mimicker-in-the-Browser: A Novel Interaction Using Mimicry to Augment the Browsing Experience 浏览器中的mimicker:一种使用模仿来增强浏览体验的新型交互

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418811

Riku Arakawa, Hiromu Yakura

Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.

众所周知，在社交互动中模仿他人的动作时，人类对他人的潜意识印象会更好。尽管存在这种有影响力的现象，但它在人机交互中的应用目前仅限于特定领域，例如虚拟现实中模仿用户头部运动的代理，因为捕获用户运动通常需要外部传感器。如果我们可以在一个可扩展的平台上实现模仿效果，而不需要这样的传感器，将会引入一种设计人机交互的新方法。因此，我们调查了用户是否对仅使用网络摄像头的独立web应用程序提供的模拟代理感到积极。我们还研究了一个基于头部运动改变背景图案的网页是否能培养良好的印象。在我们的实验中证实的积极影响支持模仿作为一种新的设计实践来增强我们的日常浏览体验。

引用次数: 0

Temporal Attention and Consistency Measuring for Video Question Answering 视频问答的时间注意与一致性测量

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418886

Lingyu Zhang, R. Radke

Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.

社会信号处理算法在解决群体讨论的视听记录中定义明确的预测和估计问题方面变得越来越好。然而，人类的许多行为和交流都没有那么结构化，而且更加微妙。在本文中，我们从人类互动的各种视听记录中解决了通用问题的回答问题。目标是为一个关于视频中人类互动的自由文本问题选择正确的自由文本答案。我们提出了一个基于rnn的模型，该模型具有两个新颖的思想:一个时间注意模块，突出显示问题和候选答案中的关键词和短语;一个一致性测量模块，对多模态数据、问题和候选答案之间的相似性进行评分。这个小的一致性分数集构成了最终问答阶段的输入，从而产生轻量级模型。我们证明了我们的模型在包含数百个视频和问题/答案对的Social-IQ数据集上达到了最先进的精度。

引用次数: 3

Human-centered Multimodal Machine Intelligence 以人为中心的多模态机器智能

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417974

Shrikanth S. Narayanan

Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and

多模态机器智能为帮助理解人类状况和创造支持和增强人类体验的技术提供了巨大的可能性[1,2]。这些方法和系统之所以令人兴奋，是因为它们有望在人体内和人与人之间丰富而巨大的内在异质性、多样性和多样性面前实现适应和个性化。多模态工程方法可以帮助客观地、大规模地分析人类特征(例如，年龄)、状态(例如，情感)和行为动态(例如，交互同步)。机器智能还可以帮助检测和分析典型模式的偏差。这些技术反过来又可以帮助、促进或加强人类和自主系统的决策。实现这样的承诺需要解决两个经常相互交织的主要挑战:创造适用于每个人的包容性技术，同时启用能够阐明可变性或兴趣差异来源的工具。本次演讲将通过两个特定领域的例子来强调其中的一些可能性和机会。第一个方面涉及在行为和心理健康领域推进健康信息学[3,4]。由于世界上10%以上的人口受到影响，并且临床研究和实践在诊断、管理和治疗该病方面严重依赖(相对稀缺的)人类专业知识，因此在提供获取途径和工具以支持大规模护理方面存在巨大的工程机会。例如，在确定一个孩子是否属于自闭症谱系时，临床医生会参与并观察孩子的一系列互动活动，以相关的认知、交际和社会情感方面为目标，并编纂特定的兴趣模式，如语音语调的典型性、面部表情、共同注意行为。机器智能驱动的语音、语言、视觉和生理数据的处理，并将它们与其他形式的临床数据相结合，使支持和扩大这些诊断的新颖和客观的方法成为可能。同样，多模式系统可以自动分析心理治疗过程，包括计算治疗质量保证措施，例如，评估治疗师表达的同理心。这些技术的可能性可以超越传统的诊所领域，直接在病人的自然环境中。例如，生物行为线索的远程多模态传感可以为筛选和跟踪行为(例如，工作场所的压力)和治疗进展(例如，抑郁症)提供新的方法，并提供及时的支持。第二个例子来自媒体世界。媒体是人类创造的，也是为人类讲述故事而创造的。它们涵盖了从艺术和娱乐到新闻、教育和商业的惊人领域，而且数量惊人。机器智能工具可以帮助分析媒体并衡量它们对个人和社会的影响。这包括通过从交叉角度出发，沿着包容性的相关维度(性别、种族、性别、年龄、能力和其他属性)对媒体描绘进行强有力的刻画，并创造支持变革的工具，从而对媒体表现中的多样性和包容性提供客观的见解[5,6]。这再次强调了双重技术要求:无论可变性的维度如何，都要在描述个人特征方面表现得同样出色，并使用这些包容性技术来照亮并创造支持多样性和包容性的工具。

{"title":"Human-centered Multimodal Machine Intelligence","authors":"Shrikanth S. Narayanan","doi":"10.1145/3382507.3417974","DOIUrl":"https://doi.org/10.1145/3382507.3417974","url":null,"abstract":"Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and ","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114499536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effect of Modality on Human and Machine Scoring of Presentation Videos 模态对演示视频人机评分的影响

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418880

Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan

We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.

我们研究了在英语口语交流训练和评估的背景下，观察到的数据模式对人类和机器对信息演示的评分的影响。三组评分员根据视频、音频或文本文本，使用自定义评分标准对大学生三分钟演讲的内容进行评分。我们发现，与观看同一演讲的视频相比，评分者在观看成绩单或听录音时所给出的分数存在显著差异，并对这些差异进行了分析。使用人类评分，我们训练机器学习模型分别使用文本、音频和视频特征对给定的演示进行评分。我们分析了机器得分的分布与我们在人类得分中观察到的模态和标签偏差，讨论了它对机器得分的影响，并为这个方向的未来工作推荐了最佳实践。我们的结果证明了在多模态性能评估中检查和纠正不同模态偏差的重要性。

{"title":"Effect of Modality on Human and Machine Scoring of Presentation Videos","authors":"Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan","doi":"10.1145/3382507.3418880","DOIUrl":"https://doi.org/10.1145/3382507.3418880","url":null,"abstract":"We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121668554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries 两国传统与非传统工作面试方法的比较研究

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418824

K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi

With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.

随着科技的进步，新的平台代替了面对面的面试。特别令人感兴趣的是异步视频面试(AVI)平台，候选人对着屏幕提问，以及基于虚拟代理的面试平台，其中类似人类的化身面试候选人。这些随时随地的面试系统扩大了公司面试过程的整体范围，尽管它们可能不会为候选人提供最好的体验。一个重要的研究问题是候选人如何看待这些平台及其对他们的表现和行为的影响。另外，一种场景vs另一种场景是否有优势，比如化身vs平台?最后，这些差异在不同文化中是否一致?在本文中，我们介绍了在三种不同的面试环境(即面对面、化身和平台)以及两种不同的文化背景(即印度和瑞士)中进行的比较研究结果，并分析了自评、他人评和自动视听行为线索的差异。

{"title":"Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries","authors":"K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi","doi":"10.1145/3382507.3418824","DOIUrl":"https://doi.org/10.1145/3382507.3418824","url":null,"abstract":"With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123838026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild X-AWARE:基于上下文感知的人类-环境注意力融合，用于驾驶员注视预测

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417967

Lukas Stappen, Georgios Rizos, Björn Schuller

Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.

可靠的自动估计驾驶员目光的系统对于减少交通死亡人数和许多旨在开发智能车乘系统的新兴研究领域至关重要。注视估计是一项具有挑战性的任务，特别是在具有不同照明和反射特性的环境中。此外，司机的面部外观也存在很大的差异，无论是在遮挡(例如视力辅助)还是在文化/种族背景方面。出于这个原因，将人脸与上下文信息(例如，车辆舱室环境)一起分析，为设计稳健的乘客凝视估计系统增加了另一个不那么主观的信号。在本文中，我们提出了一种集成的方法来联合建模不同的特征。特别是，为了改善视觉捕捉环境与驾驶员面部的融合，我们开发了一种上下文注意机制，X-AWARE，直接附加到InceptionResNetV2网络的输出卷积层。为了展示我们方法的有效性，我们使用了驾驶员注视野生数据集，该数据集最近作为第八届野生挑战情感识别(EmotiW)挑战的一部分发布。我们的最佳模型在验证集上的准确度比基线高出15.03%，在测试集上的准确度比之前报告的最佳结果高出8.72%。

{"title":"X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild","authors":"Lukas Stappen, Georgios Rizos, Björn Schuller","doi":"10.1145/3382507.3417967","DOIUrl":"https://doi.org/10.1145/3382507.3417967","url":null,"abstract":"Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125539930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges EmotiW 2020:驾驶员凝视、群体情感、学生参与和基于生理信号的挑战

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417973

Abhinav Dhall, Garima Sharma, R. Goecke, Tom Gedeon

This paper introduces the Eighth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW is a benchmarking effort run as a grand challenge of the 22nd ACM International Conference on Multimodal Interaction 2020. It comprises of four tasks related to automatic human behavior analysis: a) driver gaze prediction; b) audio-visual group-level emotion recognition; c) engagement prediction in the wild; and d) physiological signal based emotion recognition. The motivation of EmotiW is to bring researchers in affective computing, computer vision, speech processing and machine learning to a common platform for evaluating techniques on a test data. We discuss the challenge protocols, databases and their associated baselines.

本文介绍了第八届野外情绪识别挑战赛(EmotiW)。EmotiW是作为2020年第22届ACM国际多模式交互会议的重大挑战而进行的基准测试工作。它包括与人类行为自动分析相关的四个任务:a)驾驶员注视预测;B)视听群体级情感识别;C)野外参与度预测;d)基于生理信号的情绪识别。EmotiW的动机是将情感计算、计算机视觉、语音处理和机器学习方面的研究人员带到一个共同的平台上，以评估测试数据上的技术。我们讨论了挑战协议、数据库及其相关基线。

引用次数: 65

MSP-Face Corpus: A Natural Audiovisual Emotional Database 面部语料库:一个自然的视听情感数据库

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418872

Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso

Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.

在日常互动中传达的表达性行为很难确定，因为它们通常由不同情绪的混合组成。表达性人类交流的复杂性是建立和评估能够可靠预测情绪的自动系统的重要挑战。情感识别系统通常使用有限的数据库进行训练，其中的情感要么是由演员激发的，要么是由演员记录的。这些方法不一定反映真实的情绪，当同样的情绪识别系统应用于实际应用时，会产生不匹配。开发反映情绪外化复杂性的丰富的情绪数据库是建立更好的情绪识别模型的重要一步。本研究提出了MSP-Face数据库，这是一个从视频分享网站获得的自然视听数据库，其中多个个体讨论各种主题，表达他们的观点和经验。自然记录传达了广泛的情感，这是其他替代数据收集协议难以获得的。语料库的一个特征是两个集合的加法。第一组包括使用众包协议标注了情感标签的视频(9,370段录音，24小时，41米)。第二组包括没有情感标签的类似视频(17,955段录音，45小时，57米)，为探索自然情感视频的半监督和无监督机器学习算法提供了完美的基础设施。本研究描述了语料库的收集和注释过程。它还提供了使用单模态(音频，视频)和多模态情感识别系统的新数据库的基线。

{"title":"MSP-Face Corpus: A Natural Audiovisual Emotional Database","authors":"Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso","doi":"10.1145/3382507.3418872","DOIUrl":"https://doi.org/10.1145/3382507.3418872","url":null,"abstract":"Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Towards a Multimodal and Context-Aware Framework for Human Navigational Intent Inference 面向人类导航意图推理的多模态和上下文感知框架

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3421156

Z. Zhang

A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.

一个被社会接受的机器人需要做出正确的决定，能够理解人类的意图，以便与人类互动并安全地在人类周围导航。尽管计算机视觉和机器人技术的研究近年来取得了巨大的进步，但今天的机器人系统仍然需要更好地理解人类的意图，才能更有效地被广泛接受。目前，这种推断通常只使用一种感知模式，如视觉或人类运动轨迹。在这篇扩展摘要中，我描述了我的博士研究计划，即开发一种新的多模态和上下文感知框架，其中机器人通过由人类时间面部、身体姿势和凝视特征、人类运动特征以及环境背景组成的多模态感知来推断人类导航意图。为了实现这一框架，设计了一个数据收集实验来获取多模态人机交互数据。我们最初的框架设计是基于一个时间神经网络模型，以人体运动、身体姿势和头部方向特征作为输入。我们将增加神经网络模型的复杂性以及输入特征。从长远来看，这个框架可以使自动驾驶、服务和家用机器人等各种环境受益。

{"title":"Towards a Multimodal and Context-Aware Framework for Human Navigational Intent Inference","authors":"Z. Zhang","doi":"10.1145/3382507.3421156","DOIUrl":"https://doi.org/10.1145/3382507.3421156","url":null,"abstract":"A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126567628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Comparison between Laboratory and Wearable Sensors in the Context of Physiological Synchrony 生理同步环境下实验室传感器与可穿戴传感器的比较

Proceedings of the 2020 International Conference on Multimodal Interaction

Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418837

Jasper J. van Beers, I. Stuldreher, Nattapong Thammasan, A. Brouwer

Measuring concurrent changes in autonomic physiological responses aggregated across individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure physiology rather than their high-end laboratory counterparts is desirable. Since it is currently ambiguous how different signal properties (arising from different types of measuring equipment) influence the detection of PS associated with mental processes, it is unclear whether, or to what extent, PS based on data from wearables compares to that from their laboratory equivalents. Existing literature has investigated PS using both types of equipment, but none compared them directly. In this study, we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse of heart rate) of participants who listened to the same audio stream but were either instructed to attend to the presented narrative (n=13) or to the interspersed auditory events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram (ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was classified based on which attentional group they shared greater synchrony with. For both types of sensors, we found classification accuracies of 73% or higher in both EDA and IBI. We found no significant difference in classification accuracies between the laboratory and wearable sensors. These findings encourage the use of wearables for PS based research and for in-the-field measurements.

测量个体间自主生理反应的同步变化(生理同步- PS)可以深入了解群体水平的认知或情绪过程。利用廉价和易于使用的可穿戴传感器来测量生理，而不是他们的高端实验室同行是可取的。由于目前尚不清楚不同的信号特性(来自不同类型的测量设备)如何影响与心理过程相关的PS检测，因此尚不清楚基于可穿戴设备数据的PS是否或在多大程度上与实验室等效数据相比较。现有文献研究了使用两种设备的PS，但没有直接比较它们。在这项研究中，我们测量了皮电活动(EDA)和心跳间隔(IBI，心率的倒数)的PS，这些参与者听了相同的音频流，但被指示注意所呈现的叙述(n=13)或穿插的听觉事件(n=13)。使用实验室和可穿戴传感器(ActiveTwo心电图和EDA);雅虎股票和EdaMove4。参与者的注意力状况是根据他们与哪个注意力组有更大的同步性来分类的。对于这两种类型的传感器，我们发现EDA和IBI的分类准确率为73%或更高。我们发现实验室和可穿戴传感器在分类精度上没有显著差异。这些发现鼓励可穿戴设备用于基于PS的研究和现场测量。

{"title":"A Comparison between Laboratory and Wearable Sensors in the Context of Physiological Synchrony","authors":"Jasper J. van Beers, I. Stuldreher, Nattapong Thammasan, A. Brouwer","doi":"10.1145/3382507.3418837","DOIUrl":"https://doi.org/10.1145/3382507.3418837","url":null,"abstract":"Measuring concurrent changes in autonomic physiological responses aggregated across individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure physiology rather than their high-end laboratory counterparts is desirable. Since it is currently ambiguous how different signal properties (arising from different types of measuring equipment) influence the detection of PS associated with mental processes, it is unclear whether, or to what extent, PS based on data from wearables compares to that from their laboratory equivalents. Existing literature has investigated PS using both types of equipment, but none compared them directly. In this study, we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse of heart rate) of participants who listened to the same audio stream but were either instructed to attend to the presented narrative (n=13) or to the interspersed auditory events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram (ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was classified based on which attentional group they shared greater synchrony with. For both types of sensors, we found classification accuracies of 73% or higher in both EDA and IBI. We found no significant difference in classification accuracies between the laboratory and wearable sensors. These findings encourage the use of wearables for PS based research and for in-the-field measurements.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126790660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10