Proceedings of the 2015 ACM on International Conference on Multimodal Interaction最新文献

英文中文

2015 Multimodal Learning and Analytics Grand Challenge 2015年多模式学习与分析大挑战

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2829995

M. Worsley, K. Chiluiza, Joseph F. Grafsgaard, X. Ochoa

Multimodality is an integral part of teaching and learning. Over the past few decades researchers have been designing, creating and analyzing novel environments that enable students to experience and demonstrate learning through a variety of modalities. The recent availability of low cost multimodal sensors, advances in artificial intelligence and improved techniques for large scale data analysis have enabled researchers and practitioners to push the boundaries on multimodal learning and multimodal learning analytics. In an effort to continue these developments, the 2015 Multimodal Learning and Analytics Grand Challenge includes a combined focus on new techniques to capture multimodal learning data, as well as the development of rich, multimodal learning applications.

多模态是教与学的有机组成部分。在过去的几十年里，研究人员一直在设计、创造和分析新颖的环境，使学生能够通过各种方式体验和展示学习。最近，低成本的多模态传感器的出现、人工智能的进步以及大规模数据分析技术的改进，使研究人员和从业者能够在多模态学习和多模态学习分析方面突破界限。为了继续这些发展，2015年多模式学习与分析大挑战将重点放在捕获多模式学习数据的新技术上，以及开发丰富的多模式学习应用程序。

引用次数: 9

Gestimator: Shape and Stroke Similarity Based Gesture Recognition 基于形状和笔画相似性的手势识别

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2820734

Yina Ye, P. Nurmi

Template-based approaches are currently the most popular gesture recognition solution for interactive systems as they provide accurate and runtime efficient performance in a wide range of applications. The basic idea in these approaches is to measure similarity between a user gesture and a set of pre-recorded templates, and to determine the appropriate gesture type using a nearest neighbor classifier. While simple and elegant, this approach performs well only when the gestures are relatively simple and unambiguous. In increasingly many scenarios, such as authentication, interactive learning, and health care applications, the gestures of interest are complex, consist of multiple sub-strokes, and closely resemble other gestures. Merely considering the shape of the gesture is not sufficient for these scenarios, and robust identification of the constituent sequence of sub-strokes is also required. The present paper contributes by introducing Gestimator, a novel gesture recognizer that combines shape and stroke-based similarity into a sequential classification framework for robust gesture recognition. Experiments carried out using three datasets demonstrate significant performance gains compared to current state-of-the-art techniques. The performance improvements are highest for complex gestures, but consistent improvements are achieved even for simple and widely studied gesture types.

基于模板的手势识别方法是当前交互系统中最流行的手势识别解决方案，因为它们在广泛的应用中提供了准确和高效的运行时性能。这些方法的基本思想是测量用户手势和一组预先录制的模板之间的相似性，并使用最近邻分类器确定适当的手势类型。虽然简单而优雅，但这种方法只有在手势相对简单和明确的情况下才能发挥作用。在越来越多的场景中，例如身份验证、交互式学习和医疗保健应用程序，感兴趣的手势是复杂的，由多个子笔画组成，并且与其他手势非常相似。仅仅考虑手势的形状对于这些场景是不够的，并且还需要对子笔画的组成序列进行鲁棒识别。本文介绍了一种新的手势识别器Gestimator，它将基于形状和笔画的相似性结合到一个序列分类框架中，用于鲁棒手势识别。与当前最先进的技术相比，使用三个数据集进行的实验显示了显著的性能提升。对于复杂的手势，性能的提高是最高的，但是对于简单的和被广泛研究的手势类型，也有一致的提高。

{"title":"Gestimator: Shape and Stroke Similarity Based Gesture Recognition","authors":"Yina Ye, P. Nurmi","doi":"10.1145/2818346.2820734","DOIUrl":"https://doi.org/10.1145/2818346.2820734","url":null,"abstract":"Template-based approaches are currently the most popular gesture recognition solution for interactive systems as they provide accurate and runtime efficient performance in a wide range of applications. The basic idea in these approaches is to measure similarity between a user gesture and a set of pre-recorded templates, and to determine the appropriate gesture type using a nearest neighbor classifier. While simple and elegant, this approach performs well only when the gestures are relatively simple and unambiguous. In increasingly many scenarios, such as authentication, interactive learning, and health care applications, the gestures of interest are complex, consist of multiple sub-strokes, and closely resemble other gestures. Merely considering the shape of the gesture is not sufficient for these scenarios, and robust identification of the constituent sequence of sub-strokes is also required. The present paper contributes by introducing Gestimator, a novel gesture recognizer that combines shape and stroke-based similarity into a sequential classification framework for robust gesture recognition. Experiments carried out using three datasets demonstrate significant performance gains compared to current state-of-the-art techniques. The performance improvements are highest for complex gestures, but consistent improvements are achieved even for simple and widely studied gesture types.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86431193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring 评价语音，面部，情绪和身体运动时间序列特征的自动多模态演示评分

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2820765

Vikram Ramanarayanan, C. W. Leong, L. Chen, G. Feng, David Suendermann-Oeft

We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.

我们分析了如何融合从不同的多模态数据流中获得的特征，如语音，面部，身体运动和情感轨迹，可以应用于多模态演示的评分。我们从这些数据流中计算时间聚合和基于时间序列的特征——前者是在整个时间序列中计算的统计函数和其他累积特征，而后者被称为共同发生的直方图，捕捉在多模态、多变量时间序列的演变中，不同的原型身体姿势或面部配置如何在不同的时间滞后内共同发生。我们研究了这些特征的相对效用，以及在预测演示熟练程度的多个方面的人类评分分数方面的策划语音流特征。我们发现不同的模式在预测不同的方面是有用的，甚至在分析的方面的一个子集中优于幼稚的人类内部协议基线。

引用次数: 45

Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input 看&脚踏:免提导航在可缩放的信息空间，通过目光支持的脚输入

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2820751

Konstantin Klamka, A. Siegel, Stefan Vogt, F. Göbel, S. Stellmach, Raimund Dachselt

For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.

对于台式计算机，我们研究了如何通过结合凝视和脚的输入方式来增强传统的鼠标和键盘交互。这种多模式方法提供了在快速连续甚至并行的可缩放信息空间中流畅地执行手动输入(例如，精确的对象选择)和凝视支持的脚输入(用于平移和缩放)的潜力。为此，我们利用快速凝视输入来隐式地指示导航位置，并利用额外的显式脚输入来控制速度，同时腾出双手进行进一步的手动输入。这允许以一种微妙而不引人注目的方式利用凝视输入。我们仔细地阐述和研究了三种变体的足部控制，包括单向、双向和多向脚踏板与凝视的结合。在一项使用谷歌地球作为地理信息系统的用户研究中，这些数据被评估并与仅用鼠标输入的数据进行了比较。结果表明，支持凝视的足部输入对于方便、友好的导航是可行的，并且可以与鼠标输入相媲美，并鼓励进一步研究支持凝视的足部控制。

{"title":"Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input","authors":"Konstantin Klamka, A. Siegel, Stefan Vogt, F. Göbel, S. Stellmach, Raimund Dachselt","doi":"10.1145/2818346.2820751","DOIUrl":"https://doi.org/10.1145/2818346.2820751","url":null,"abstract":"For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79590685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Sharing Representations for Long Tail Computer Vision Problems 长尾计算机视觉问题的共享表示

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2818348

Samy Bengio

The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. This has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image captioning, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.

当少量的对象/词/类非常频繁，因此很容易建模，而更多的对象/词/类非常罕见，因此很难建模时，就会出现长尾现象。这一直是机器学习中的一个问题。我们首先解释为什么表示共享，特别是嵌入方法，可以帮助表示尾部对象。提出了几种嵌入方法，以增加复杂性，以展示如何解决长尾问题，从图像分类中的罕见类到未见类(所谓的零射击设置)。最后，我们展示了我们在图像字幕方面的最新成果，这可以被视为一个终极罕见的类问题，因为每个图像都以有意义的描述性句子的形式归因于一个新颖而结构化的类。

引用次数: 28

Touch Challenge '15: Recognizing Social Touch Gestures 触摸挑战’15:识别社交触摸手势

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2829993

Merel M. Jung, Laura Cang, M. Poel, Karon E Maclean

Advances in the field of touch recognition could open up applications for touch-based interaction in areas such as Human-Robot Interaction (HRI). We extended this challenge to the research community working on multimodal interaction with the goal of sparking interest in the touch modality and to promote exploration of the use of data processing techniques from other more mature modalities for touch recognition. Two data sets were made available containing labeled pressure sensor data of social touch gestures that were performed by touching a touch-sensitive surface with the hand. Each set was collected from similar sensor grids, but under conditions reflecting different application orientations: CoST: Corpus of Social Touch and HAART: The Human-Animal Affective Robot Touch gesture set. In this paper we describe the challenge protocol and summarize the results from the touch challenge hosted in conjunction with the 2015 ACM International Conference on Multimodal Interaction (ICMI). The most important outcomes of the challenges were: (1) transferring techniques from other modalities, such as image processing, speech, and human action recognition provided valuable feature sets; (2) gesture classification confusions were similar despite the various data processing methods used.

触摸识别领域的进步将为人机交互(HRI)等领域的基于触摸的交互开辟应用。我们将这一挑战扩展到致力于多模态交互的研究社区，目的是激发人们对触摸模态的兴趣，并促进探索使用来自其他更成熟的触摸识别模态的数据处理技术。有两个数据集包含社交触摸手势的标记压力传感器数据，这些数据是通过用手触摸触摸敏感的表面来执行的。每个集合都是从相似的传感器网格中收集的，但在反映不同应用方向的条件下:CoST: Social Touch语料库和HAART:人类-动物情感机器人触摸手势集。在本文中，我们描述了挑战协议，并总结了与2015年ACM国际多模态交互会议(ICMI)一起主办的触摸挑战的结果。这些挑战最重要的结果是:(1)从其他模式(如图像处理、语音和人类动作识别)转移技术提供了有价值的特征集;(2)尽管使用了不同的数据处理方法，但手势分类混淆相似。

{"title":"Touch Challenge '15: Recognizing Social Touch Gestures","authors":"Merel M. Jung, Laura Cang, M. Poel, Karon E Maclean","doi":"10.1145/2818346.2829993","DOIUrl":"https://doi.org/10.1145/2818346.2829993","url":null,"abstract":"Advances in the field of touch recognition could open up applications for touch-based interaction in areas such as Human-Robot Interaction (HRI). We extended this challenge to the research community working on multimodal interaction with the goal of sparking interest in the touch modality and to promote exploration of the use of data processing techniques from other more mature modalities for touch recognition. Two data sets were made available containing labeled pressure sensor data of social touch gestures that were performed by touching a touch-sensitive surface with the hand. Each set was collected from similar sensor grids, but under conditions reflecting different application orientations: CoST: Corpus of Social Touch and HAART: The Human-Animal Affective Robot Touch gesture set. In this paper we describe the challenge protocol and summarize the results from the touch challenge hosted in conjunction with the 2015 ACM International Conference on Multimodal Interaction (ICMI). The most important outcomes of the challenges were: (1) transferring techniques from other modalities, such as image processing, speech, and human action recognition provided valuable feature sets; (2) gesture classification confusions were similar despite the various data processing methods used.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"31 10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89972224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Public Speaking Training with a Multimodal Interactive Virtual Audience Framework 基于多模态互动虚拟听众框架的公开演讲训练

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2823294

Mathieu Chollet, Kalin Stefanov, H. Prendinger, Stefan Scherer

We have developed an interactive virtual audience platform for public speaking training. Users' public speaking behavior is automatically analyzed using multimodal sensors, and ultimodal feedback is produced by virtual characters and generic visual widgets depending on the user's behavior. The flexibility of our system allows to compare different interaction mediums (e.g. virtual reality vs normal interaction), social situations (e.g. one-on-one meetings vs large audiences) and trained behaviors (e.g. general public speaking performance vs specific behaviors).

我们开发了一个互动的虚拟听众平台，用于公共演讲培训。使用多模态传感器自动分析用户的公开演讲行为，并根据用户的行为由虚拟人物和通用视觉小部件产生多模态反馈。我们系统的灵活性允许比较不同的互动媒介(例如虚拟现实与正常互动)，社交场合(例如一对一会议与大量观众)和训练行为(例如一般公开演讲表演与特定行为)。

引用次数: 18

Towards Attentive, Bi-directional MOOC Learning on Mobile Devices 面向移动设备的专注、双向MOOC学习

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2820754

Xiang Xiao, Jingtao Wang

AttentiveLearner is a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner converts the built-in camera of mobile devices into both a tangible video control channel and an implicit heart rate sensing channel by analyzing the learner's fingertip transparency changes in real time. In this paper, we report disciplined research efforts in making AttentiveLearner truly practical in real-world use. Through two 18-participant user studies and follow-up analyses, we found that 1) the tangible video control interface is intuitive to use and efficient to operate; 2) heart rate signals implicitly captured by AttentiveLearner can be used to infer both the learner's interests and perceived confusion levels towards the corresponding learning topics; 3) AttentiveLearner can achieve significantly higher accuracy by predicting extreme personal learning events and aggregated learning events.

AttentiveLearner是一个移动学习系统，针对大规模开放在线课程(MOOCs)和翻转课堂的讲座视频进行了优化。AttentiveLearner通过实时分析学习者指尖透明度的变化，将移动设备的内置摄像头转换为有形的视频控制通道和隐式心率感知通道。在本文中，我们报告了严谨的研究工作，使AttentiveLearner在现实世界中真正实用。通过两项18人参与的用户研究和跟踪分析，我们发现1)有形的视频控制界面使用直观，操作高效;2)由AttentiveLearner隐式捕获的心率信号可以用来推断学习者对相应学习主题的兴趣和感知困惑程度;3) AttentiveLearner通过预测极端的个人学习事件和聚合学习事件，可以获得显著更高的准确率。

引用次数: 30

Behavioral and Emotional Spoken Cues Related to Mental States in Human-Robot Social Interaction 人-机器人社会互动中与心理状态相关的行为和情感言语线索

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2820777

Lucile Bechade, G. D. Duplessis, M. A. Sehili, L. Devillers

Understanding human behavioral and emotional cues occurring in interaction has become a major research interest due to the emergence of numerous applications such as in social robotics. While there is agreement across different theories that some behavioral signals are involved in communicating information, there is a lack of consensus regarding their specificity, their universality, and whether they convey emotions, affective, cognitive, mental states or all of those. Our goal in this study is to explore the relationship between behavioral and emotional cues extracted from speech (e.g., laughter, speech duration, negative emotions) with different communicative information about the human participant. This study is based on a corpus of audio/video data of humorous interactions between the nao{} robot and 37 human participants. Participants filled three questionnaires about their personality, sense of humor and mental states regarding the interaction. This work reveals the existence of many links between behavioral and emotional cues and the mental states reported by human participants through self-report questionnaires. However, we have not found a clear connection between reported mental states and participants profiles.

由于社交机器人等众多应用的出现，理解人类在互动中发生的行为和情感线索已成为一个主要的研究兴趣。虽然不同的理论都认为某些行为信号参与了信息交流，但对于它们的特殊性、普遍性，以及它们是否传达了情感、情感、认知、精神状态或所有这些，人们缺乏共识。本研究的目的是探讨从言语中提取的行为和情绪线索(如笑声、言语持续时间、负面情绪)与人类参与者的不同交际信息之间的关系。本研究基于nao{}机器人与37名人类参与者之间幽默互动的音频/视频数据语料库。参与者填写了三份问卷，内容涉及他们的个性、幽默感和与互动有关的精神状态。这项工作揭示了行为和情绪线索与人类参与者通过自我报告问卷报告的精神状态之间存在许多联系。然而，我们还没有发现报告的精神状态和参与者档案之间有明确的联系。

{"title":"Behavioral and Emotional Spoken Cues Related to Mental States in Human-Robot Social Interaction","authors":"Lucile Bechade, G. D. Duplessis, M. A. Sehili, L. Devillers","doi":"10.1145/2818346.2820777","DOIUrl":"https://doi.org/10.1145/2818346.2820777","url":null,"abstract":"Understanding human behavioral and emotional cues occurring in interaction has become a major research interest due to the emergence of numerous applications such as in social robotics. While there is agreement across different theories that some behavioral signals are involved in communicating information, there is a lack of consensus regarding their specificity, their universality, and whether they convey emotions, affective, cognitive, mental states or all of those. Our goal in this study is to explore the relationship between behavioral and emotional cues extracted from speech (e.g., laughter, speech duration, negative emotions) with different communicative information about the human participant. This study is based on a corpus of audio/video data of humorous interactions between the nao{} robot and 37 human participants. Participants filled three questionnaires about their personality, sense of humor and mental states regarding the interaction. This work reveals the existence of many links between behavioral and emotional cues and the mental states reported by human participants through self-report questionnaires. However, we have not found a clear connection between reported mental states and participants profiles.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76776505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Multimodal Capture of Teacher-Student Interactions for Automated Dialogic Analysis in Live Classrooms 实时课堂中自动化对话分析的师生互动的多模态捕获

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Pub Date : 2015-11-09 DOI: 10.1145/2818346.2830602

S. D’Mello, A. Olney, Nathaniel Blanchard, Borhan Samei, Xiaoyi Sun, Brooke Ward, Sean Kelly

We focus on data collection designs for the automated analysis of teacher-student interactions in live classrooms with the goal of identifying instructional activities (e.g., lecturing, discussion) and assessing the quality of dialogic instruction (e.g., analysis of questions). Our designs were motivated by multiple technical requirements and constraints. Most importantly, teachers could be individually micfied but their audio needed to be of excellent quality for automatic speech recognition (ASR) and spoken utterance segmentation. Individual students could not be micfied but classroom audio quality only needed to be sufficient to detect student spoken utterances. Visual information could only be recorded if students could not be identified. Design 1 used an omnidirectional laptop microphone to record both teacher and classroom audio and was quickly deemed unsuitable. In Designs 2 and 3, teachers wore a wireless Samson AirLine 77 vocal headset system, which is a unidirectional microphone with a cardioid pickup pattern. In Design 2, classroom audio was recorded with dual first- generation Microsoft Kinects placed at the front corners of the class. Design 3 used a Crown PZM-30D pressure zone microphone mounted on the blackboard to record classroom audio. Designs 2 and 3 were tested by recording audio in 38 live middle school classrooms from six U.S. schools while trained human coders simultaneously performed live coding of classroom discourse. Qualitative and quantitative analyses revealed that Design 3 was suitable for three of our core tasks: (1) ASR on teacher speech (word recognition rate of 66% and word overlap rate of 69% using Google Speech ASR engine); (2) teacher utterance segmentation (F-measure of 97%); and (3) student utterance segmentation (F-measure of 66%). Ideas to incorporate video and skeletal tracking with dual second-generation Kinects to produce Design 4 are discussed.

我们专注于实时课堂中师生互动自动分析的数据收集设计，目标是识别教学活动(例如，讲座，讨论)和评估对话教学的质量(例如，问题分析)。我们的设计受到多种技术需求和约束的推动。最重要的是，教师可以被单独识别，但他们的音频需要高质量的自动语音识别(ASR)和语音分割。个别学生不能被识别，但课堂音频质量只需要足以检测学生的口语。只有在无法识别学生身份的情况下，才能记录视觉信息。设计1使用全向笔记本麦克风记录老师和教室的声音，很快被认为不合适。在设计2和3中，老师们戴着无线Samson AirLine 77语音耳机系统，这是一个单向麦克风，带有心形拾音器模式。在设计2中，教室的音频是用放置在教室前角的双第一代微软kinect录制的。设计3采用安装在黑板上的Crown PZM-30D压区麦克风录制课堂音频。设计2和设计3通过在美国六所学校的38个中学教室现场录制音频进行测试，同时训练有素的人类编码员对课堂话语进行现场编码。定性和定量分析表明，设计3适合我们的三个核心任务:(1)对教师语音进行ASR(使用Google speech ASR引擎，单词识别率为66%，单词重叠率为69%);(2)教师话语分割(f值为97%);(3)学生话语分割(f值66%)。讨论了将视频和骨骼跟踪与双第二代kinect结合到设计4中的想法。

{"title":"Multimodal Capture of Teacher-Student Interactions for Automated Dialogic Analysis in Live Classrooms","authors":"S. D’Mello, A. Olney, Nathaniel Blanchard, Borhan Samei, Xiaoyi Sun, Brooke Ward, Sean Kelly","doi":"10.1145/2818346.2830602","DOIUrl":"https://doi.org/10.1145/2818346.2830602","url":null,"abstract":"We focus on data collection designs for the automated analysis of teacher-student interactions in live classrooms with the goal of identifying instructional activities (e.g., lecturing, discussion) and assessing the quality of dialogic instruction (e.g., analysis of questions). Our designs were motivated by multiple technical requirements and constraints. Most importantly, teachers could be individually micfied but their audio needed to be of excellent quality for automatic speech recognition (ASR) and spoken utterance segmentation. Individual students could not be micfied but classroom audio quality only needed to be sufficient to detect student spoken utterances. Visual information could only be recorded if students could not be identified. Design 1 used an omnidirectional laptop microphone to record both teacher and classroom audio and was quickly deemed unsuitable. In Designs 2 and 3, teachers wore a wireless Samson AirLine 77 vocal headset system, which is a unidirectional microphone with a cardioid pickup pattern. In Design 2, classroom audio was recorded with dual first- generation Microsoft Kinects placed at the front corners of the class. Design 3 used a Crown PZM-30D pressure zone microphone mounted on the blackboard to record classroom audio. Designs 2 and 3 were tested by recording audio in 38 live middle school classrooms from six U.S. schools while trained human coders simultaneously performed live coding of classroom discourse. Qualitative and quantitative analyses revealed that Design 3 was suitable for three of our core tasks: (1) ASR on teacher speech (word recognition rate of 66% and word overlap rate of 69% using Google Speech ASR engine); (2) teacher utterance segmentation (F-measure of 97%); and (3) student utterance segmentation (F-measure of 66%). Ideas to incorporate video and skeletal tracking with dual second-generation Kinects to produce Design 4 are discussed.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72882724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀