首页 > 最新文献

Proceedings of the 2020 International Conference on Multimodal Interaction最新文献

英文 中文
Extract the Gaze Multi-dimensional Information Analysis Driver Behavior 提取注视多维信息分析驾驶员行为
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417972
Kui Lyu, Minghao Wang, Liyu Meng
Recent studies has been shown that most traffic accidents are related to the driver's engagement in the driving process. Driver gaze is considered as an important cue to monitor driver distraction. While there has been marked improvement in driver gaze region estimation systems, but there are many challenges exist like cross subject test, perspectives and sensor configuration. In this paper, we propose a Convolutional Neural Networks (CNNs) based multi-model fusion gaze zone estimation systems. Our method mainly consists of two blocks, which implemented the extraction of gaze features based on RGB images and estimation of gaze based on head pose features. Based on the original input image, first general face processing model were used to detect face and localize 3D landmarks, and then extract the most relevant facial information based on it. We implement three face alignment methods to normalize the face information. For the above image-based features, using a multi-input CNN classifier can get reliable classification accuracy. In addition, we design a 2D CNN based PointNet predict the head pose representation by 3D landmarks. Finally, we evaluate our best performance model on the Eighth EmotiW Driver Gaze Prediction sub-challenge test dataset. Our model has a competitive overall accuracy of 81.5144% gaze zone estimation ability on the cross-subject test dataset.
最近的研究表明,大多数交通事故都与驾驶员在驾驶过程中的投入有关。驾驶员凝视被认为是监测驾驶员注意力分散的重要线索。虽然驾驶员注视区域估计系统已经有了明显的进步,但仍然存在许多挑战,如跨主体测试、视角和传感器配置。本文提出了一种基于卷积神经网络(cnn)的多模型融合注视区域估计系统。我们的方法主要由两个块组成,分别实现了基于RGB图像的凝视特征提取和基于头部姿势特征的凝视估计。在原始输入图像的基础上,首先利用通用人脸处理模型进行人脸检测和三维地标定位,然后在此基础上提取最相关的人脸信息。我们实现了三种人脸对齐方法来对人脸信息进行归一化。对于上述基于图像的特征,使用多输入CNN分类器可以获得可靠的分类精度。此外,我们设计了一个基于二维CNN的PointNet,通过三维地标来预测头部姿态的表示。最后,我们在第八EmotiW驾驶员凝视预测子挑战测试数据集上评估了我们的最佳性能模型。我们的模型在跨主题测试数据集上具有81.5144%的注视区域估计能力,整体精度具有竞争力。
{"title":"Extract the Gaze Multi-dimensional Information Analysis Driver Behavior","authors":"Kui Lyu, Minghao Wang, Liyu Meng","doi":"10.1145/3382507.3417972","DOIUrl":"https://doi.org/10.1145/3382507.3417972","url":null,"abstract":"Recent studies has been shown that most traffic accidents are related to the driver's engagement in the driving process. Driver gaze is considered as an important cue to monitor driver distraction. While there has been marked improvement in driver gaze region estimation systems, but there are many challenges exist like cross subject test, perspectives and sensor configuration. In this paper, we propose a Convolutional Neural Networks (CNNs) based multi-model fusion gaze zone estimation systems. Our method mainly consists of two blocks, which implemented the extraction of gaze features based on RGB images and estimation of gaze based on head pose features. Based on the original input image, first general face processing model were used to detect face and localize 3D landmarks, and then extract the most relevant facial information based on it. We implement three face alignment methods to normalize the face information. For the above image-based features, using a multi-input CNN classifier can get reliable classification accuracy. In addition, we design a 2D CNN based PointNet predict the head pose representation by 3D landmarks. Finally, we evaluate our best performance model on the Eighth EmotiW Driver Gaze Prediction sub-challenge test dataset. Our model has a competitive overall accuracy of 81.5144% gaze zone estimation ability on the cross-subject test dataset.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
You Have a Point There: Object Selection Inside an Automobile Using Gaze, Head Pose and Finger Pointing 你有一个点:对象选择在汽车内使用凝视,头部姿势和手指指向
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418836
Abdul Rafey Aftab, M. V. D. Beeck, M. Feld
Sophisticated user interaction in the automotive industry is a fast emerging topic. Mid-air gestures and speech already have numerous applications for driver-car interaction. Additionally, multimodal approaches are being developed to leverage the use of multiple sensors for added advantages. In this paper, we propose a fast and practical multimodal fusion method based on machine learning for the selection of various control modules in an automotive vehicle. The modalities taken into account are gaze, head pose and finger pointing gesture. Speech is used only as a trigger for fusion. Single modality has previously been used numerous times for recognition of the user's pointing direction. We, however, demonstrate how multiple inputs can be fused together to enhance the recognition performance. Furthermore, we compare different deep neural network architectures against conventional Machine Learning methods, namely Support Vector Regression and Random Forests, and show the enhancements in the pointing direction accuracy using deep learning. The results suggest a great potential for the use of multimodal inputs that can be applied to more use cases in the vehicle.
复杂的用户交互在汽车行业是一个快速兴起的话题。空中手势和语音在驾驶员与汽车的互动中已经有了很多应用。此外,正在开发多模式方法,以利用多个传感器的使用来获得额外的优势。在本文中,我们提出了一种基于机器学习的快速实用的多模态融合方法,用于汽车各种控制模块的选择。考虑的方式是凝视,头部姿势和手指手势。语言只是作为融合的触发因素。单模态以前已经多次用于识别用户指向的方向。然而,我们演示了如何将多个输入融合在一起以提高识别性能。此外,我们将不同的深度神经网络架构与传统的机器学习方法(即支持向量回归和随机森林)进行了比较,并展示了使用深度学习在指向精度方面的增强。结果表明,多模式输入的使用潜力巨大,可以应用于车辆中的更多用例。
{"title":"You Have a Point There: Object Selection Inside an Automobile Using Gaze, Head Pose and Finger Pointing","authors":"Abdul Rafey Aftab, M. V. D. Beeck, M. Feld","doi":"10.1145/3382507.3418836","DOIUrl":"https://doi.org/10.1145/3382507.3418836","url":null,"abstract":"Sophisticated user interaction in the automotive industry is a fast emerging topic. Mid-air gestures and speech already have numerous applications for driver-car interaction. Additionally, multimodal approaches are being developed to leverage the use of multiple sensors for added advantages. In this paper, we propose a fast and practical multimodal fusion method based on machine learning for the selection of various control modules in an automotive vehicle. The modalities taken into account are gaze, head pose and finger pointing gesture. Speech is used only as a trigger for fusion. Single modality has previously been used numerous times for recognition of the user's pointing direction. We, however, demonstrate how multiple inputs can be fused together to enhance the recognition performance. Furthermore, we compare different deep neural network architectures against conventional Machine Learning methods, namely Support Vector Regression and Random Forests, and show the enhancements in the pointing direction accuracy using deep learning. The results suggest a great potential for the use of multimodal inputs that can be applied to more use cases in the vehicle.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121217645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents 虚拟代理的多模态类人特征与视觉韵律表达
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3421155
Mireille Fares
One of the key challenges in designing Embodied Conversational Agents (ECA) is to produce human-like gestural and visual prosody expressivity. Another major challenge is to maintain the interlocutor's attention by adapting the agent's behavior to the interlocutor's multimodal behavior. This paper outlines my PhD research plan that aims to develop convincing expressive and natural behavior in ECAs and to explore and model the mechanisms that govern human-agent multimodal interaction. Additionally, I describe in this paper my first PhD milestone which focuses on developing an end-to-end LSTM Neural Network model for upper-face gestures generation. The main task consists of building a model that can produce expressive and coherent upper-face gestures while considering multiple modalities: speech audio, text, and action units.
具身会话代理(ECA)设计的关键挑战之一是产生类似人类的手势和视觉韵律表达能力。另一个主要挑战是通过调整代理的行为以适应对话者的多模态行为来保持对话者的注意力。本文概述了我的博士研究计划,旨在开发eca中令人信服的表达和自然行为,并探索和建模控制人类-代理多模态交互的机制。此外,我在论文中描述了我的第一个博士里程碑,重点是开发用于上脸手势生成的端到端LSTM神经网络模型。主要任务包括构建一个模型,该模型可以在考虑多种模式(语音、音频、文本和动作单元)的情况下产生富有表现力和连贯的上脸手势。
{"title":"Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents","authors":"Mireille Fares","doi":"10.1145/3382507.3421155","DOIUrl":"https://doi.org/10.1145/3382507.3421155","url":null,"abstract":"One of the key challenges in designing Embodied Conversational Agents (ECA) is to produce human-like gestural and visual prosody expressivity. Another major challenge is to maintain the interlocutor's attention by adapting the agent's behavior to the interlocutor's multimodal behavior. This paper outlines my PhD research plan that aims to develop convincing expressive and natural behavior in ECAs and to explore and model the mechanisms that govern human-agent multimodal interaction. Additionally, I describe in this paper my first PhD milestone which focuses on developing an end-to-end LSTM Neural Network model for upper-face gestures generation. The main task consists of building a model that can produce expressive and coherent upper-face gestures while considering multiple modalities: speech audio, text, and action units.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124122641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Automating Facilitation and Documentation of Collaborative Ideation Processes 协作构思过程的自动化促进和文档化
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3421158
Matthias Merk
My research is is in the field of computer supported and enabled innovation processes, in particular focusing on the first phases of ideation in a co-located environment. I'm developing a concept for documenting, tracking and enhancing creative ideation processes. Base of this concept are key figures derived from various system within the ideation sessions. The system designed in my doctoral thesis enables interdisciplinary teams to kick-start creativity by automating facilitation, moderation, creativity support and documentation of the process. Using the example of brainstorming, a standing table is equipped with camera and microphone based sensing as well as multiple ways of interaction and visualization through projection and LED lights. The user interaction with the table is implicit and based on real time metadata generated by the users of the system. System actions are calculated based on what is happening on the table using object recognition. Everything on the table influences the system thus making it into a multimodal input and output device with implicit interaction. While the technical aspects of my research are close to be done, the more problematic part of evaluation will benefit from feedback from the specialists for multimodal interaction at ICMI20.
我的研究领域是计算机支持和支持的创新过程,特别是集中在一个共同定位的环境中的创意的第一阶段。我正在开发一个概念,用于记录、跟踪和增强创意过程。这个概念的基础是在构思会议中从各个系统中得出的关键数字。我在博士论文中设计的系统使跨学科团队能够通过自动化促进、调节、创造力支持和过程文档来启动创造力。以头脑风暴为例,一张站立的桌子配备了基于摄像头和麦克风的传感,以及通过投影和LED灯进行多种互动和可视化的方式。用户与表的交互是隐式的,并且基于系统用户生成的实时元数据。系统动作是基于使用对象识别的表上发生的事情来计算的。桌子上的所有东西都会影响系统,从而使其成为具有隐式交互的多模态输入和输出设备。虽然我的研究的技术方面即将完成,但评估中更有问题的部分将从ICMI20的多模式交互专家的反馈中受益。
{"title":"Automating Facilitation and Documentation of Collaborative Ideation Processes","authors":"Matthias Merk","doi":"10.1145/3382507.3421158","DOIUrl":"https://doi.org/10.1145/3382507.3421158","url":null,"abstract":"My research is is in the field of computer supported and enabled innovation processes, in particular focusing on the first phases of ideation in a co-located environment. I'm developing a concept for documenting, tracking and enhancing creative ideation processes. Base of this concept are key figures derived from various system within the ideation sessions. The system designed in my doctoral thesis enables interdisciplinary teams to kick-start creativity by automating facilitation, moderation, creativity support and documentation of the process. Using the example of brainstorming, a standing table is equipped with camera and microphone based sensing as well as multiple ways of interaction and visualization through projection and LED lights. The user interaction with the table is implicit and based on real time metadata generated by the users of the system. System actions are calculated based on what is happening on the table using object recognition. Everything on the table influences the system thus making it into a multimodal input and output device with implicit interaction. While the technical aspects of my research are close to be done, the more problematic part of evaluation will benefit from feedback from the specialists for multimodal interaction at ICMI20.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126456722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bridging Social Sciences and AI for Understanding Child Behaviour 连接社会科学和人工智能来理解儿童行为
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3419745
Heysem Kaya, R. Hessels, M. Najafian, S. Hanekamp, Saeid Safavi
Child behaviour is a topic of wide scientific interest among many different disciplines, including social and behavioural sciences and artificial intelligence (AI). In this workshop, we aimed to connect researchers from these fields to address topics such as the usage of AI to better understand and model child behavioural and developmental processes, challenges and opportunities for AI in large-scale child behaviour analysis and implementing explainable ML/AI on sensitive child data. The workshop served as a successful first step towards this goal and attracted contributions from different research disciplines on the analysis of child behaviour. This paper provides a summary of the activities of the workshop and the accepted papers and abstracts.
儿童行为是许多不同学科广泛关注的科学话题,包括社会和行为科学以及人工智能(AI)。在本次研讨会中,我们的目标是将这些领域的研究人员联系起来,讨论人工智能的使用,以更好地理解和模拟儿童行为和发展过程,人工智能在大规模儿童行为分析中的挑战和机遇,以及在敏感儿童数据上实施可解释的ML/AI。讲习班是实现这一目标的成功的第一步,吸引了来自不同研究学科对儿童行为分析的贡献。本文简要介绍了本次研讨会的活动和已接受的论文和摘要。
{"title":"Bridging Social Sciences and AI for Understanding Child Behaviour","authors":"Heysem Kaya, R. Hessels, M. Najafian, S. Hanekamp, Saeid Safavi","doi":"10.1145/3382507.3419745","DOIUrl":"https://doi.org/10.1145/3382507.3419745","url":null,"abstract":"Child behaviour is a topic of wide scientific interest among many different disciplines, including social and behavioural sciences and artificial intelligence (AI). In this workshop, we aimed to connect researchers from these fields to address topics such as the usage of AI to better understand and model child behavioural and developmental processes, challenges and opportunities for AI in large-scale child behaviour analysis and implementing explainable ML/AI on sensitive child data. The workshop served as a successful first step towards this goal and attracted contributions from different research disciplines on the analysis of child behaviour. This paper provides a summary of the activities of the workshop and the accepted papers and abstracts.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-rate Attention Based GRU Model for Engagement Prediction 基于多速率注意力的用户粘性预测GRU模型
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3417965
Bin Zhu, Xinjie Lan, Xin Guo, K. Barner, C. Boncelet
Engagement detection is essential in many areas such as driver attention tracking, employee engagement monitoring, and student engagement evaluation. In this paper, we propose a novel approach using attention based hybrid deep models for the 8th Emotion Recognition in the Wild (EmotiW 2020) Grand Challenge in the category of engagement prediction in the wild EMOTIW2020. The task aims to predict the engagement intensity of subjects in videos, and the subjects are students watching educational videos from Massive Open Online Courses (MOOCs). To complete the task, we propose a hybrid deep model based on multi-rate and multi-instance attention. The novelty of the proposed model can be summarized in three aspects: (a) an attention based Gated Recurrent Unit (GRU) deep network, (b) heuristic multi-rate processing on video based data, and (c) a rigorous and accurate ensemble model. Experimental results on the validation set and test set show that our method makes promising improvements, achieving a competitively low MSE of 0.0541 on the test set, improving on the baseline results by 64%. The proposed model won the first place in the engagement prediction in the wild challenge.
敬业度检测在许多领域都是必不可少的,比如司机注意力跟踪、员工敬业度监测和学生敬业度评估。在本文中,我们提出了一种使用基于注意力的混合深度模型的新方法,用于第八届野生情绪识别(EMOTIW2020)大挑战赛(EMOTIW2020)的野生情绪识别(EMOTIW2020)投入预测类别。该任务旨在预测视频中科目的参与强度,受试者是观看大规模在线开放课程(Massive Open Online Courses, MOOCs)教育视频的学生。为了完成这一任务,我们提出了一种基于多速率和多实例关注的混合深度模型。该模型的新颖性可以概括为三个方面:(a)基于注意力的门控循环单元(GRU)深度网络,(b)基于视频数据的启发式多速率处理,以及(c)严格而准确的集成模型。在验证集和测试集上的实验结果表明,我们的方法取得了有希望的改进,在测试集上实现了0.0541的竞争性低MSE,比基线结果提高了64%。该模型在野外挑战赛的参与度预测中获得第一名。
{"title":"Multi-rate Attention Based GRU Model for Engagement Prediction","authors":"Bin Zhu, Xinjie Lan, Xin Guo, K. Barner, C. Boncelet","doi":"10.1145/3382507.3417965","DOIUrl":"https://doi.org/10.1145/3382507.3417965","url":null,"abstract":"Engagement detection is essential in many areas such as driver attention tracking, employee engagement monitoring, and student engagement evaluation. In this paper, we propose a novel approach using attention based hybrid deep models for the 8th Emotion Recognition in the Wild (EmotiW 2020) Grand Challenge in the category of engagement prediction in the wild EMOTIW2020. The task aims to predict the engagement intensity of subjects in videos, and the subjects are students watching educational videos from Massive Open Online Courses (MOOCs). To complete the task, we propose a hybrid deep model based on multi-rate and multi-instance attention. The novelty of the proposed model can be summarized in three aspects: (a) an attention based Gated Recurrent Unit (GRU) deep network, (b) heuristic multi-rate processing on video based data, and (c) a rigorous and accurate ensemble model. Experimental results on the validation set and test set show that our method makes promising improvements, achieving a competitively low MSE of 0.0541 on the test set, improving on the baseline results by 64%. The proposed model won the first place in the engagement prediction in the wild challenge.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131770964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
ROSMI: A Multimodal Corpus for Map-based Instruction-Giving 基于地图的多模态语料库教学
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418861
Miltiadis Marios Katsakioris, Ioannis Konstas, P. Mignotte, Helen F. Hastie
We present the publicly-available Robot Open Street Map Instructions (ROSMI) corpus: a rich multimodal dataset of map and natural language instruction pairs that was collected via crowdsourcing. The goal of this corpus is to aid in the advancement of state-of-the-art visual-dialogue tasks, including reference resolution and robot-instruction understanding. The domain described here concerns robots and autonomous systems being used for inspection and emergency response. The ROSMI corpus is unique in that it captures interaction grounded in map-based visual stimuli that is both human-readable but also contains rich metadata that is needed to plan and deploy robots and autonomous systems, thus facilitating human-robot teaming.
我们展示了公开可用的机器人开放街道地图指令(ROSMI)语料库:通过众包收集的地图和自然语言指令对的丰富多模态数据集。该语料库的目标是帮助推进最先进的视觉对话任务,包括参考分辨率和机器人指令理解。这里描述的领域涉及用于检查和应急响应的机器人和自主系统。ROSMI语料库的独特之处在于,它捕获了基于地图的视觉刺激的交互,这些交互既可由人类阅读,又包含了规划和部署机器人和自主系统所需的丰富元数据,从而促进了人机合作。
{"title":"ROSMI: A Multimodal Corpus for Map-based Instruction-Giving","authors":"Miltiadis Marios Katsakioris, Ioannis Konstas, P. Mignotte, Helen F. Hastie","doi":"10.1145/3382507.3418861","DOIUrl":"https://doi.org/10.1145/3382507.3418861","url":null,"abstract":"We present the publicly-available Robot Open Street Map Instructions (ROSMI) corpus: a rich multimodal dataset of map and natural language instruction pairs that was collected via crowdsourcing. The goal of this corpus is to aid in the advancement of state-of-the-art visual-dialogue tasks, including reference resolution and robot-instruction understanding. The domain described here concerns robots and autonomous systems being used for inspection and emergency response. The ROSMI corpus is unique in that it captures interaction grounded in map-based visual stimuli that is both human-readable but also contains rich metadata that is needed to plan and deploy robots and autonomous systems, thus facilitating human-robot teaming.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125641220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Touch Recognition with Attentive End-to-End Model 基于细心端到端模型的触摸识别
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418834
Wail El Bani, M. Chetouani
Touch is the earliest sense to develop and the first mean of contact with the external world. Touch also plays a key role in our socio-emotional communication: we use it to communicate our feelings, elicit strong emotions in others and modulate behavior (e.g compliance). Although its relevance, touch is an understudied modality in Human-Machine-Interaction compared to audition and vision. Most of the social touch recognition systems require a feature engineering step making them difficult to compare and to generalize to other databases. In this paper, we propose an end-to-end approach. We present an attention-based end-to-end model for touch gesture recognition evaluated on two public datasets (CoST and HAART) in the context of the ICMI 15 Social Touch Challenge. Our model gave a similar level of accuracy: 61% for CoST and 68% for HAART and uses self-attention as an alternative to feature engineering and Recurrent Neural Networks.
触觉是人类最早发展起来的感官,也是与外界接触的第一种方式。触摸在我们的社会情感交流中也起着关键作用:我们用它来交流我们的感受,引发他人的强烈情绪,调节行为(如顺从)。虽然与听觉和视觉相关,但在人机交互中,触觉是一种未被充分研究的方式。大多数社交触摸识别系统都需要一个特征工程步骤,这使得它们难以比较和推广到其他数据库。在本文中,我们提出了一种端到端方法。在ICMI 15社交触摸挑战的背景下,我们提出了一个基于注意力的端到端触摸手势识别模型,该模型在两个公共数据集(CoST和HAART)上进行了评估。我们的模型给出了类似的精度水平:成本为61%,HAART为68%,并使用自关注作为特征工程和循环神经网络的替代方案。
{"title":"Touch Recognition with Attentive End-to-End Model","authors":"Wail El Bani, M. Chetouani","doi":"10.1145/3382507.3418834","DOIUrl":"https://doi.org/10.1145/3382507.3418834","url":null,"abstract":"Touch is the earliest sense to develop and the first mean of contact with the external world. Touch also plays a key role in our socio-emotional communication: we use it to communicate our feelings, elicit strong emotions in others and modulate behavior (e.g compliance). Although its relevance, touch is an understudied modality in Human-Machine-Interaction compared to audition and vision. Most of the social touch recognition systems require a feature engineering step making them difficult to compare and to generalize to other databases. In this paper, we propose an end-to-end approach. We present an attention-based end-to-end model for touch gesture recognition evaluated on two public datasets (CoST and HAART) in the context of the ICMI 15 Social Touch Challenge. Our model gave a similar level of accuracy: 61% for CoST and 68% for HAART and uses self-attention as an alternative to feature engineering and Recurrent Neural Networks.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114707871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MORSE: MultimOdal sentiment analysis for Real-life SEttings 莫尔斯:多模态情感分析的现实生活设置
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418821
Yiqun Yao, Verónica Pérez-Rosas, M. Abouelenien, Mihai Burzo
Multimodal sentiment analysis aims to detect and classify sentiment expressed in multimodal data. Research to date has focused on datasets with a large number of training samples, manual transcriptions, and nearly-balanced sentiment labels. However, data collection in real settings often leads to small datasets with noisy transcriptions and imbalanced label distributions, which are therefore significantly more challenging than in controlled settings. In this work, we introduce MORSE, a domain-specific dataset for MultimOdal sentiment analysis in Real-life SEttings. The dataset consists of 2,787 video clips extracted from 49 interviews with panelists in a product usage study, with each clip annotated for positive, negative, or neutral sentiment. The characteristics of MORSE include noisy transcriptions from raw videos, naturally imbalanced label distribution, and scarcity of minority labels. To address the challenging real-life settings in MORSE, we propose a novel two-step fine-tuning method for multimodal sentiment classification using transfer learning and the Transformer model architecture; our method starts with a pre-trained language model and one step of fine-tuning on the language modality, followed by the second step of joint fine-tuning that incorporates the visual and audio modalities. Experimental results show that while MORSE is challenging for various baseline models such as SVM and Transformer, our two-step fine-tuning method is able to capture the dataset characteristics and effectively address the challenges. Our method outperforms related work that uses both single and multiple modalities in the same transfer learning settings.
多模态情感分析的目的是对多模态数据中表达的情感进行检测和分类。迄今为止的研究主要集中在具有大量训练样本、手动转录和近乎平衡的情感标签的数据集上。然而,在真实环境中的数据收集通常会导致具有嘈杂转录和不平衡标签分布的小数据集,因此比在受控环境中更具挑战性。在这项工作中,我们引入了MORSE,一个用于现实生活中多模态情感分析的领域特定数据集。该数据集由2787个视频片段组成,这些视频片段是从产品使用研究中与49位小组成员的访谈中提取出来的,每个片段都标注了积极、消极或中性的情绪。MORSE的特点包括来自原始视频的嘈杂转录,自然不平衡的标签分布以及少数标签的稀缺性。为了解决MORSE中具有挑战性的现实环境,我们提出了一种新的两步微调方法,该方法使用迁移学习和Transformer模型架构进行多模态情感分类;我们的方法从一个预训练的语言模型开始,对语言模态进行一步微调,然后是第二步联合微调,将视觉和音频模态结合起来。实验结果表明,虽然MORSE对各种基线模型(如SVM和Transformer)具有挑战性,但我们的两步微调方法能够捕获数据集特征并有效解决挑战。我们的方法优于在相同迁移学习设置中使用单一和多种模式的相关工作。
{"title":"MORSE: MultimOdal sentiment analysis for Real-life SEttings","authors":"Yiqun Yao, Verónica Pérez-Rosas, M. Abouelenien, Mihai Burzo","doi":"10.1145/3382507.3418821","DOIUrl":"https://doi.org/10.1145/3382507.3418821","url":null,"abstract":"Multimodal sentiment analysis aims to detect and classify sentiment expressed in multimodal data. Research to date has focused on datasets with a large number of training samples, manual transcriptions, and nearly-balanced sentiment labels. However, data collection in real settings often leads to small datasets with noisy transcriptions and imbalanced label distributions, which are therefore significantly more challenging than in controlled settings. In this work, we introduce MORSE, a domain-specific dataset for MultimOdal sentiment analysis in Real-life SEttings. The dataset consists of 2,787 video clips extracted from 49 interviews with panelists in a product usage study, with each clip annotated for positive, negative, or neutral sentiment. The characteristics of MORSE include noisy transcriptions from raw videos, naturally imbalanced label distribution, and scarcity of minority labels. To address the challenging real-life settings in MORSE, we propose a novel two-step fine-tuning method for multimodal sentiment classification using transfer learning and the Transformer model architecture; our method starts with a pre-trained language model and one step of fine-tuning on the language modality, followed by the second step of joint fine-tuning that incorporates the visual and audio modalities. Experimental results show that while MORSE is challenging for various baseline models such as SVM and Transformer, our two-step fine-tuning method is able to capture the dataset characteristics and effectively address the challenges. Our method outperforms related work that uses both single and multiple modalities in the same transfer learning settings.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114988310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Finally on Par?! Multimodal and Unimodal Interaction for Open Creative Design Tasks in Virtual Reality 终于平起平坐了?!虚拟现实中开放式创意设计任务的多模态和单模态交互
Pub Date : 2020-10-21 DOI: 10.1145/3382507.3418850
C. Zimmerer, Erik Wolf, Sara Wolf, Martin Fischbach, Jean-Luc Lugrin, Marc Erich Latoschik
Multimodal Interfaces (MMIs) have been considered to provide promising interaction paradigms for Virtual Reality (VR) for some time. However, they are still far less common than unimodal interfaces (UMIs). This paper presents a summative user study comparing an MMI to a typical UMI for a design task in VR. We developed an application targeting creative 3D object manipulations, i.e., creating 3D objects and modifying typical object properties such as color or size. The associated open user task is based on the Torrence Tests of Creative Thinking. We compared a synergistic multimodal interface using speech-accompanied pointing/grabbing gestures with a more typical unimodal interface using a hierarchical radial menu to trigger actions on selected objects. Independent judges rated the creativity of the resulting products using the Consensual Assessment Technique. Additionally, we measured the creativity-promoting factors flow, usability, and presence. Our results show that the MMI performs on par with the UMI in all measurements despite its limited flexibility and reliability. These promising results demonstrate the technological maturity of MMIs and their potential to extend traditional interaction techniques in VR efficiently.
一段时间以来,多模态接口(MMIs)被认为是虚拟现实(VR)中有前途的交互模式。然而,它们仍然远不如单模态接口(UMIs)常见。本文提出了一项总结性的用户研究,比较了虚拟现实设计任务中的MMI和典型UMI。我们开发了一个针对创造性3D对象操作的应用程序,即创建3D对象并修改典型对象属性(如颜色或大小)。相关的开放用户任务基于托伦斯创造性思维测试。我们比较了使用语音伴随的指向/抓取手势的协同多模态界面与使用分层径向菜单触发选定对象操作的更典型的单模态界面。独立评委使用共识评估技术对最终产品的创造力进行评级。此外,我们还测量了促进创造力的因素流、可用性和存在感。我们的研究结果表明,尽管MMI的灵活性和可靠性有限,但它在所有测量中都与UMI表现相当。这些有希望的结果表明了mmi技术的成熟及其在VR中有效扩展传统交互技术的潜力。
{"title":"Finally on Par?! Multimodal and Unimodal Interaction for Open Creative Design Tasks in Virtual Reality","authors":"C. Zimmerer, Erik Wolf, Sara Wolf, Martin Fischbach, Jean-Luc Lugrin, Marc Erich Latoschik","doi":"10.1145/3382507.3418850","DOIUrl":"https://doi.org/10.1145/3382507.3418850","url":null,"abstract":"Multimodal Interfaces (MMIs) have been considered to provide promising interaction paradigms for Virtual Reality (VR) for some time. However, they are still far less common than unimodal interfaces (UMIs). This paper presents a summative user study comparing an MMI to a typical UMI for a design task in VR. We developed an application targeting creative 3D object manipulations, i.e., creating 3D objects and modifying typical object properties such as color or size. The associated open user task is based on the Torrence Tests of Creative Thinking. We compared a synergistic multimodal interface using speech-accompanied pointing/grabbing gestures with a more typical unimodal interface using a hierarchical radial menu to trigger actions on selected objects. Independent judges rated the creativity of the resulting products using the Consensual Assessment Technique. Additionally, we measured the creativity-promoting factors flow, usability, and presence. Our results show that the MMI performs on par with the UMI in all measurements despite its limited flexibility and reliability. These promising results demonstrate the technological maturity of MMIs and their potential to extend traditional interaction techniques in VR efficiently.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121071096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Proceedings of the 2020 International Conference on Multimodal Interaction
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1