首页 > 最新文献

Proceedings of the 2020 International Conference on Multimodal Interaction最新文献

英文 中文
Mimicker-in-the-Browser: A Novel Interaction Using Mimicry to Augment the Browsing Experience 浏览器中的mimicker:一种使用模仿来增强浏览体验的新型交互
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418811
Riku Arakawa, Hiromu Yakura
Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.
众所周知,在社交互动中模仿他人的动作时,人类对他人的潜意识印象会更好。尽管存在这种有影响力的现象,但它在人机交互中的应用目前仅限于特定领域,例如虚拟现实中模仿用户头部运动的代理,因为捕获用户运动通常需要外部传感器。如果我们可以在一个可扩展的平台上实现模仿效果,而不需要这样的传感器,将会引入一种设计人机交互的新方法。因此,我们调查了用户是否对仅使用网络摄像头的独立web应用程序提供的模拟代理感到积极。我们还研究了一个基于头部运动改变背景图案的网页是否能培养良好的印象。在我们的实验中证实的积极影响支持模仿作为一种新的设计实践来增强我们的日常浏览体验。
{"title":"Mimicker-in-the-Browser: A Novel Interaction Using Mimicry to Augment the Browsing Experience","authors":"Riku Arakawa, Hiromu Yakura","doi":"10.1145/3382507.3418811","DOIUrl":"https://doi.org/10.1145/3382507.3418811","url":null,"abstract":"Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"73 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134545486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-centered Multimodal Machine Intelligence 以人为中心的多模态机器智能
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417974
Shrikanth S. Narayanan
Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and
多模态机器智能为帮助理解人类状况和创造支持和增强人类体验的技术提供了巨大的可能性[1,2]。这些方法和系统之所以令人兴奋,是因为它们有望在人体内和人与人之间丰富而巨大的内在异质性、多样性和多样性面前实现适应和个性化。多模态工程方法可以帮助客观地、大规模地分析人类特征(例如,年龄)、状态(例如,情感)和行为动态(例如,交互同步)。机器智能还可以帮助检测和分析典型模式的偏差。这些技术反过来又可以帮助、促进或加强人类和自主系统的决策。实现这样的承诺需要解决两个经常相互交织的主要挑战:创造适用于每个人的包容性技术,同时启用能够阐明可变性或兴趣差异来源的工具。本次演讲将通过两个特定领域的例子来强调其中的一些可能性和机会。第一个方面涉及在行为和心理健康领域推进健康信息学[3,4]。由于世界上10%以上的人口受到影响,并且临床研究和实践在诊断、管理和治疗该病方面严重依赖(相对稀缺的)人类专业知识,因此在提供获取途径和工具以支持大规模护理方面存在巨大的工程机会。例如,在确定一个孩子是否属于自闭症谱系时,临床医生会参与并观察孩子的一系列互动活动,以相关的认知、交际和社会情感方面为目标,并编纂特定的兴趣模式,如语音语调的典型性、面部表情、共同注意行为。机器智能驱动的语音、语言、视觉和生理数据的处理,并将它们与其他形式的临床数据相结合,使支持和扩大这些诊断的新颖和客观的方法成为可能。同样,多模式系统可以自动分析心理治疗过程,包括计算治疗质量保证措施,例如,评估治疗师表达的同理心。这些技术的可能性可以超越传统的诊所领域,直接在病人的自然环境中。例如,生物行为线索的远程多模态传感可以为筛选和跟踪行为(例如,工作场所的压力)和治疗进展(例如,抑郁症)提供新的方法,并提供及时的支持。第二个例子来自媒体世界。媒体是人类创造的,也是为人类讲述故事而创造的。它们涵盖了从艺术和娱乐到新闻、教育和商业的惊人领域,而且数量惊人。机器智能工具可以帮助分析媒体并衡量它们对个人和社会的影响。这包括通过从交叉角度出发,沿着包容性的相关维度(性别、种族、性别、年龄、能力和其他属性)对媒体描绘进行强有力的刻画,并创造支持变革的工具,从而对媒体表现中的多样性和包容性提供客观的见解[5,6]。这再次强调了双重技术要求:无论可变性的维度如何,都要在描述个人特征方面表现得同样出色,并使用这些包容性技术来照亮并创造支持多样性和包容性的工具。
{"title":"Human-centered Multimodal Machine Intelligence","authors":"Shrikanth S. Narayanan","doi":"10.1145/3382507.3417974","DOIUrl":"https://doi.org/10.1145/3382507.3417974","url":null,"abstract":"Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and ","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114499536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effect of Modality on Human and Machine Scoring of Presentation Videos 模态对演示视频人机评分的影响
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418880
Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan
We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.
我们研究了在英语口语交流训练和评估的背景下,观察到的数据模式对人类和机器对信息演示的评分的影响。三组评分员根据视频、音频或文本文本,使用自定义评分标准对大学生三分钟演讲的内容进行评分。我们发现,与观看同一演讲的视频相比,评分者在观看成绩单或听录音时所给出的分数存在显著差异,并对这些差异进行了分析。使用人类评分,我们训练机器学习模型分别使用文本、音频和视频特征对给定的演示进行评分。我们分析了机器得分的分布与我们在人类得分中观察到的模态和标签偏差,讨论了它对机器得分的影响,并为这个方向的未来工作推荐了最佳实践。我们的结果证明了在多模态性能评估中检查和纠正不同模态偏差的重要性。
{"title":"Effect of Modality on Human and Machine Scoring of Presentation Videos","authors":"Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan","doi":"10.1145/3382507.3418880","DOIUrl":"https://doi.org/10.1145/3382507.3418880","url":null,"abstract":"We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121668554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Predicting Video Affect via Induced Affection in the Wild 在野外通过诱导情感预测视频影响
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418838
Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer
Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.
管理大型和高质量的数据集来研究影响是一个昂贵和耗时的过程,特别是当标签是连续的。在本文中,我们研究了以文本评论的形式使用未标记的公众反应来帮助分类视频影响的潜力。我们检查了用于情感识别的两个流行数据集,并挖掘了这些视频的公众反应。我们通过使用视频评分作为弱监督信号来学习这些反应的表示。我们表明,当只给一个视频时,我们的模型可以学习对评论影响的细粒度预测。此外,我们展示了如何预测评论的情感属性可以成为多模态情感建模中潜在有用的模态。
{"title":"Predicting Video Affect via Induced Affection in the Wild","authors":"Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer","doi":"10.1145/3382507.3418838","DOIUrl":"https://doi.org/10.1145/3382507.3418838","url":null,"abstract":"Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117039026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal Attention and Consistency Measuring for Video Question Answering 视频问答的时间注意与一致性测量
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418886
Lingyu Zhang, R. Radke
Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.
社会信号处理算法在解决群体讨论的视听记录中定义明确的预测和估计问题方面变得越来越好。然而,人类的许多行为和交流都没有那么结构化,而且更加微妙。在本文中,我们从人类互动的各种视听记录中解决了通用问题的回答问题。目标是为一个关于视频中人类互动的自由文本问题选择正确的自由文本答案。我们提出了一个基于rnn的模型,该模型具有两个新颖的思想:一个时间注意模块,突出显示问题和候选答案中的关键词和短语;一个一致性测量模块,对多模态数据、问题和候选答案之间的相似性进行评分。这个小的一致性分数集构成了最终问答阶段的输入,从而产生轻量级模型。我们证明了我们的模型在包含数百个视频和问题/答案对的Social-IQ数据集上达到了最先进的精度。
{"title":"Temporal Attention and Consistency Measuring for Video Question Answering","authors":"Lingyu Zhang, R. Radke","doi":"10.1145/3382507.3418886","DOIUrl":"https://doi.org/10.1145/3382507.3418886","url":null,"abstract":"Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations 眼动追踪预测用户认知能力和用户自适应叙事可视化的表现
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418884
Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati
We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.
我们利用眼动追踪数据来预测用户在阅读杂志式叙事可视化(MSNV)时的表现和认知能力水平,MSNV是一种结合文本和可视化的多模态文档的广泛形式。这种预测的动机是最近对设计能够动态适应用户需求的用户自适应msnv的兴趣。我们的研究结果为MSNV中实时用户建模的可行性提供了证据,因为我们是第一个在处理多模态文档时考虑眼动追踪数据来预测任务理解和认知能力的人。接下来,我们将讨论对个性化msnv设计的影响。
{"title":"Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations","authors":"Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati","doi":"10.1145/3382507.3418884","DOIUrl":"https://doi.org/10.1145/3382507.3418884","url":null,"abstract":"We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121492281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
MSP-Face Corpus: A Natural Audiovisual Emotional Database 面部语料库:一个自然的视听情感数据库
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418872
Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso
Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.
在日常互动中传达的表达性行为很难确定,因为它们通常由不同情绪的混合组成。表达性人类交流的复杂性是建立和评估能够可靠预测情绪的自动系统的重要挑战。情感识别系统通常使用有限的数据库进行训练,其中的情感要么是由演员激发的,要么是由演员记录的。这些方法不一定反映真实的情绪,当同样的情绪识别系统应用于实际应用时,会产生不匹配。开发反映情绪外化复杂性的丰富的情绪数据库是建立更好的情绪识别模型的重要一步。本研究提出了MSP-Face数据库,这是一个从视频分享网站获得的自然视听数据库,其中多个个体讨论各种主题,表达他们的观点和经验。自然记录传达了广泛的情感,这是其他替代数据收集协议难以获得的。语料库的一个特征是两个集合的加法。第一组包括使用众包协议标注了情感标签的视频(9,370段录音,24小时,41米)。第二组包括没有情感标签的类似视频(17,955段录音,45小时,57米),为探索自然情感视频的半监督和无监督机器学习算法提供了完美的基础设施。本研究描述了语料库的收集和注释过程。它还提供了使用单模态(音频,视频)和多模态情感识别系统的新数据库的基线。
{"title":"MSP-Face Corpus: A Natural Audiovisual Emotional Database","authors":"Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso","doi":"10.1145/3382507.3418872","DOIUrl":"https://doi.org/10.1145/3382507.3418872","url":null,"abstract":"Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries 两国传统与非传统工作面试方法的比较研究
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418824
K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi
With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.
随着科技的进步,新的平台代替了面对面的面试。特别令人感兴趣的是异步视频面试(AVI)平台,候选人对着屏幕提问,以及基于虚拟代理的面试平台,其中类似人类的化身面试候选人。这些随时随地的面试系统扩大了公司面试过程的整体范围,尽管它们可能不会为候选人提供最好的体验。一个重要的研究问题是候选人如何看待这些平台及其对他们的表现和行为的影响。另外,一种场景vs另一种场景是否有优势,比如化身vs平台?最后,这些差异在不同文化中是否一致?在本文中,我们介绍了在三种不同的面试环境(即面对面、化身和平台)以及两种不同的文化背景(即印度和瑞士)中进行的比较研究结果,并分析了自评、他人评和自动视听行为线索的差异。
{"title":"Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries","authors":"K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi","doi":"10.1145/3382507.3418824","DOIUrl":"https://doi.org/10.1145/3382507.3418824","url":null,"abstract":"With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123838026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
BreathEasy: Assessing Respiratory Diseases Using Mobile Multimodal Sensors 使用移动多模态传感器评估呼吸系统疾病
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418852
Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao
Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.
使用普通智能手机和智能手表进行移动呼吸评估,尚不能满足患者在家监测的需求。在本文中,我们展示了在消费者移动设备中使用多模态传感器进行非侵入性、低费力呼吸评估的可行性。我们对228名慢性呼吸患者和健康受试者进行了研究,结果表明,我们的模型可以以平均绝对误差(MAE) 0.72$pm$0.62呼吸/分钟估计呼吸频率,当用户将设备放在胸部或腹部一分钟正常呼吸时,我们的模型可以以90%的召回率和76%的准确率区分呼吸患者和健康受试者。与传统的肺活量测定法相比,将仪器放在胸部或腹部所需的力气要小得多,传统的肺活量测定法需要专门的仪器和有力的呼吸。本文展示了开发一种低成本的呼吸评估的可行性,使其可以通过用户自己的移动设备随时随地使用。
{"title":"BreathEasy: Assessing Respiratory Diseases Using Mobile Multimodal Sensors","authors":"Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao","doi":"10.1145/3382507.3418852","DOIUrl":"https://doi.org/10.1145/3382507.3418852","url":null,"abstract":"Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128290115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild X-AWARE:基于上下文感知的人类-环境注意力融合,用于驾驶员注视预测
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417967
Lukas Stappen, Georgios Rizos, Björn Schuller
Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.
可靠的自动估计驾驶员目光的系统对于减少交通死亡人数和许多旨在开发智能车乘系统的新兴研究领域至关重要。注视估计是一项具有挑战性的任务,特别是在具有不同照明和反射特性的环境中。此外,司机的面部外观也存在很大的差异,无论是在遮挡(例如视力辅助)还是在文化/种族背景方面。出于这个原因,将人脸与上下文信息(例如,车辆舱室环境)一起分析,为设计稳健的乘客凝视估计系统增加了另一个不那么主观的信号。在本文中,我们提出了一种集成的方法来联合建模不同的特征。特别是,为了改善视觉捕捉环境与驾驶员面部的融合,我们开发了一种上下文注意机制,X-AWARE,直接附加到InceptionResNetV2网络的输出卷积层。为了展示我们方法的有效性,我们使用了驾驶员注视野生数据集,该数据集最近作为第八届野生挑战情感识别(EmotiW)挑战的一部分发布。我们的最佳模型在验证集上的准确度比基线高出15.03%,在测试集上的准确度比之前报告的最佳结果高出8.72%。
{"title":"X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild","authors":"Lukas Stappen, Georgios Rizos, Björn Schuller","doi":"10.1145/3382507.3417967","DOIUrl":"https://doi.org/10.1145/3382507.3417967","url":null,"abstract":"Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125539930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
Proceedings of the 2020 International Conference on Multimodal Interaction
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1