M. Worsley, K. Chiluiza, Joseph F. Grafsgaard, X. Ochoa
Multimodality is an integral part of teaching and learning. Over the past few decades researchers have been designing, creating and analyzing novel environments that enable students to experience and demonstrate learning through a variety of modalities. The recent availability of low cost multimodal sensors, advances in artificial intelligence and improved techniques for large scale data analysis have enabled researchers and practitioners to push the boundaries on multimodal learning and multimodal learning analytics. In an effort to continue these developments, the 2015 Multimodal Learning and Analytics Grand Challenge includes a combined focus on new techniques to capture multimodal learning data, as well as the development of rich, multimodal learning applications.
{"title":"2015 Multimodal Learning and Analytics Grand Challenge","authors":"M. Worsley, K. Chiluiza, Joseph F. Grafsgaard, X. Ochoa","doi":"10.1145/2818346.2829995","DOIUrl":"https://doi.org/10.1145/2818346.2829995","url":null,"abstract":"Multimodality is an integral part of teaching and learning. Over the past few decades researchers have been designing, creating and analyzing novel environments that enable students to experience and demonstrate learning through a variety of modalities. The recent availability of low cost multimodal sensors, advances in artificial intelligence and improved techniques for large scale data analysis have enabled researchers and practitioners to push the boundaries on multimodal learning and multimodal learning analytics. In an effort to continue these developments, the 2015 Multimodal Learning and Analytics Grand Challenge includes a combined focus on new techniques to capture multimodal learning data, as well as the development of rich, multimodal learning applications.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87776383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Template-based approaches are currently the most popular gesture recognition solution for interactive systems as they provide accurate and runtime efficient performance in a wide range of applications. The basic idea in these approaches is to measure similarity between a user gesture and a set of pre-recorded templates, and to determine the appropriate gesture type using a nearest neighbor classifier. While simple and elegant, this approach performs well only when the gestures are relatively simple and unambiguous. In increasingly many scenarios, such as authentication, interactive learning, and health care applications, the gestures of interest are complex, consist of multiple sub-strokes, and closely resemble other gestures. Merely considering the shape of the gesture is not sufficient for these scenarios, and robust identification of the constituent sequence of sub-strokes is also required. The present paper contributes by introducing Gestimator, a novel gesture recognizer that combines shape and stroke-based similarity into a sequential classification framework for robust gesture recognition. Experiments carried out using three datasets demonstrate significant performance gains compared to current state-of-the-art techniques. The performance improvements are highest for complex gestures, but consistent improvements are achieved even for simple and widely studied gesture types.
{"title":"Gestimator: Shape and Stroke Similarity Based Gesture Recognition","authors":"Yina Ye, P. Nurmi","doi":"10.1145/2818346.2820734","DOIUrl":"https://doi.org/10.1145/2818346.2820734","url":null,"abstract":"Template-based approaches are currently the most popular gesture recognition solution for interactive systems as they provide accurate and runtime efficient performance in a wide range of applications. The basic idea in these approaches is to measure similarity between a user gesture and a set of pre-recorded templates, and to determine the appropriate gesture type using a nearest neighbor classifier. While simple and elegant, this approach performs well only when the gestures are relatively simple and unambiguous. In increasingly many scenarios, such as authentication, interactive learning, and health care applications, the gestures of interest are complex, consist of multiple sub-strokes, and closely resemble other gestures. Merely considering the shape of the gesture is not sufficient for these scenarios, and robust identification of the constituent sequence of sub-strokes is also required. The present paper contributes by introducing Gestimator, a novel gesture recognizer that combines shape and stroke-based similarity into a sequential classification framework for robust gesture recognition. Experiments carried out using three datasets demonstrate significant performance gains compared to current state-of-the-art techniques. The performance improvements are highest for complex gestures, but consistent improvements are achieved even for simple and widely studied gesture types.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86431193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vikram Ramanarayanan, C. W. Leong, L. Chen, G. Feng, David Suendermann-Oeft
We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.
{"title":"Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring","authors":"Vikram Ramanarayanan, C. W. Leong, L. Chen, G. Feng, David Suendermann-Oeft","doi":"10.1145/2818346.2820765","DOIUrl":"https://doi.org/10.1145/2818346.2820765","url":null,"abstract":"We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83886429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Klamka, A. Siegel, Stefan Vogt, F. Göbel, S. Stellmach, Raimund Dachselt
For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.
{"title":"Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input","authors":"Konstantin Klamka, A. Siegel, Stefan Vogt, F. Göbel, S. Stellmach, Raimund Dachselt","doi":"10.1145/2818346.2820751","DOIUrl":"https://doi.org/10.1145/2818346.2820751","url":null,"abstract":"For a desktop computer, we investigate how to enhance conventional mouse and keyboard interaction by combining the input modalities gaze and foot. This multimodal approach offers the potential for fluently performing both manual input (e.g., for precise object selection) and gaze-supported foot input (for pan and zoom) in zoomable information spaces in quick succession or even in parallel. For this, we take advantage of fast gaze input to implicitly indicate where to navigate to and additional explicit foot input for speed control while leaving the hands free for further manual input. This allows for taking advantage of gaze input in a subtle and unobtrusive way. We have carefully elaborated and investigated three variants of foot controls incorporating one-, two- and multidirectional foot pedals in combination with gaze. These were evaluated and compared to mouse-only input in a user study using Google Earth as a geographic information system. The results suggest that gaze-supported foot input is feasible for convenient, user-friendly navigation and comparable to mouse input and encourage further investigations of gaze-supported foot controls.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79590685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. This has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image captioning, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.
{"title":"Sharing Representations for Long Tail Computer Vision Problems","authors":"Samy Bengio","doi":"10.1145/2818346.2818348","DOIUrl":"https://doi.org/10.1145/2818346.2818348","url":null,"abstract":"The long tail phenomena appears when a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model. This has always been a problem in machine learning. We start by explaining why representation sharing in general, and embedding approaches in particular, can help to represent tail objects. Several embedding approaches are presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification (the so-called zero-shot setting). Finally, we present our latest results on image captioning, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful descriptive sentence.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"344 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79609075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Merel M. Jung, Laura Cang, M. Poel, Karon E Maclean
Advances in the field of touch recognition could open up applications for touch-based interaction in areas such as Human-Robot Interaction (HRI). We extended this challenge to the research community working on multimodal interaction with the goal of sparking interest in the touch modality and to promote exploration of the use of data processing techniques from other more mature modalities for touch recognition. Two data sets were made available containing labeled pressure sensor data of social touch gestures that were performed by touching a touch-sensitive surface with the hand. Each set was collected from similar sensor grids, but under conditions reflecting different application orientations: CoST: Corpus of Social Touch and HAART: The Human-Animal Affective Robot Touch gesture set. In this paper we describe the challenge protocol and summarize the results from the touch challenge hosted in conjunction with the 2015 ACM International Conference on Multimodal Interaction (ICMI). The most important outcomes of the challenges were: (1) transferring techniques from other modalities, such as image processing, speech, and human action recognition provided valuable feature sets; (2) gesture classification confusions were similar despite the various data processing methods used.
触摸识别领域的进步将为人机交互(HRI)等领域的基于触摸的交互开辟应用。我们将这一挑战扩展到致力于多模态交互的研究社区,目的是激发人们对触摸模态的兴趣,并促进探索使用来自其他更成熟的触摸识别模态的数据处理技术。有两个数据集包含社交触摸手势的标记压力传感器数据,这些数据是通过用手触摸触摸敏感的表面来执行的。每个集合都是从相似的传感器网格中收集的,但在反映不同应用方向的条件下:CoST: Social Touch语料库和HAART:人类-动物情感机器人触摸手势集。在本文中,我们描述了挑战协议,并总结了与2015年ACM国际多模态交互会议(ICMI)一起主办的触摸挑战的结果。这些挑战最重要的结果是:(1)从其他模式(如图像处理、语音和人类动作识别)转移技术提供了有价值的特征集;(2)尽管使用了不同的数据处理方法,但手势分类混淆相似。
{"title":"Touch Challenge '15: Recognizing Social Touch Gestures","authors":"Merel M. Jung, Laura Cang, M. Poel, Karon E Maclean","doi":"10.1145/2818346.2829993","DOIUrl":"https://doi.org/10.1145/2818346.2829993","url":null,"abstract":"Advances in the field of touch recognition could open up applications for touch-based interaction in areas such as Human-Robot Interaction (HRI). We extended this challenge to the research community working on multimodal interaction with the goal of sparking interest in the touch modality and to promote exploration of the use of data processing techniques from other more mature modalities for touch recognition. Two data sets were made available containing labeled pressure sensor data of social touch gestures that were performed by touching a touch-sensitive surface with the hand. Each set was collected from similar sensor grids, but under conditions reflecting different application orientations: CoST: Corpus of Social Touch and HAART: The Human-Animal Affective Robot Touch gesture set. In this paper we describe the challenge protocol and summarize the results from the touch challenge hosted in conjunction with the 2015 ACM International Conference on Multimodal Interaction (ICMI). The most important outcomes of the challenges were: (1) transferring techniques from other modalities, such as image processing, speech, and human action recognition provided valuable feature sets; (2) gesture classification confusions were similar despite the various data processing methods used.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"31 10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89972224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathieu Chollet, Kalin Stefanov, H. Prendinger, Stefan Scherer
We have developed an interactive virtual audience platform for public speaking training. Users' public speaking behavior is automatically analyzed using multimodal sensors, and ultimodal feedback is produced by virtual characters and generic visual widgets depending on the user's behavior. The flexibility of our system allows to compare different interaction mediums (e.g. virtual reality vs normal interaction), social situations (e.g. one-on-one meetings vs large audiences) and trained behaviors (e.g. general public speaking performance vs specific behaviors).
{"title":"Public Speaking Training with a Multimodal Interactive Virtual Audience Framework","authors":"Mathieu Chollet, Kalin Stefanov, H. Prendinger, Stefan Scherer","doi":"10.1145/2818346.2823294","DOIUrl":"https://doi.org/10.1145/2818346.2823294","url":null,"abstract":"We have developed an interactive virtual audience platform for public speaking training. Users' public speaking behavior is automatically analyzed using multimodal sensors, and ultimodal feedback is produced by virtual characters and generic visual widgets depending on the user's behavior. The flexibility of our system allows to compare different interaction mediums (e.g. virtual reality vs normal interaction), social situations (e.g. one-on-one meetings vs large audiences) and trained behaviors (e.g. general public speaking performance vs specific behaviors).","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91200726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
AttentiveLearner is a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner converts the built-in camera of mobile devices into both a tangible video control channel and an implicit heart rate sensing channel by analyzing the learner's fingertip transparency changes in real time. In this paper, we report disciplined research efforts in making AttentiveLearner truly practical in real-world use. Through two 18-participant user studies and follow-up analyses, we found that 1) the tangible video control interface is intuitive to use and efficient to operate; 2) heart rate signals implicitly captured by AttentiveLearner can be used to infer both the learner's interests and perceived confusion levels towards the corresponding learning topics; 3) AttentiveLearner can achieve significantly higher accuracy by predicting extreme personal learning events and aggregated learning events.
{"title":"Towards Attentive, Bi-directional MOOC Learning on Mobile Devices","authors":"Xiang Xiao, Jingtao Wang","doi":"10.1145/2818346.2820754","DOIUrl":"https://doi.org/10.1145/2818346.2820754","url":null,"abstract":"AttentiveLearner is a mobile learning system optimized for consuming lecture videos in Massive Open Online Courses (MOOCs) and flipped classrooms. AttentiveLearner converts the built-in camera of mobile devices into both a tangible video control channel and an implicit heart rate sensing channel by analyzing the learner's fingertip transparency changes in real time. In this paper, we report disciplined research efforts in making AttentiveLearner truly practical in real-world use. Through two 18-participant user studies and follow-up analyses, we found that 1) the tangible video control interface is intuitive to use and efficient to operate; 2) heart rate signals implicitly captured by AttentiveLearner can be used to infer both the learner's interests and perceived confusion levels towards the corresponding learning topics; 3) AttentiveLearner can achieve significantly higher accuracy by predicting extreme personal learning events and aggregated learning events.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89584453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucile Bechade, G. D. Duplessis, M. A. Sehili, L. Devillers
Understanding human behavioral and emotional cues occurring in interaction has become a major research interest due to the emergence of numerous applications such as in social robotics. While there is agreement across different theories that some behavioral signals are involved in communicating information, there is a lack of consensus regarding their specificity, their universality, and whether they convey emotions, affective, cognitive, mental states or all of those. Our goal in this study is to explore the relationship between behavioral and emotional cues extracted from speech (e.g., laughter, speech duration, negative emotions) with different communicative information about the human participant. This study is based on a corpus of audio/video data of humorous interactions between the nao{} robot and 37 human participants. Participants filled three questionnaires about their personality, sense of humor and mental states regarding the interaction. This work reveals the existence of many links between behavioral and emotional cues and the mental states reported by human participants through self-report questionnaires. However, we have not found a clear connection between reported mental states and participants profiles.
{"title":"Behavioral and Emotional Spoken Cues Related to Mental States in Human-Robot Social Interaction","authors":"Lucile Bechade, G. D. Duplessis, M. A. Sehili, L. Devillers","doi":"10.1145/2818346.2820777","DOIUrl":"https://doi.org/10.1145/2818346.2820777","url":null,"abstract":"Understanding human behavioral and emotional cues occurring in interaction has become a major research interest due to the emergence of numerous applications such as in social robotics. While there is agreement across different theories that some behavioral signals are involved in communicating information, there is a lack of consensus regarding their specificity, their universality, and whether they convey emotions, affective, cognitive, mental states or all of those. Our goal in this study is to explore the relationship between behavioral and emotional cues extracted from speech (e.g., laughter, speech duration, negative emotions) with different communicative information about the human participant. This study is based on a corpus of audio/video data of humorous interactions between the nao{} robot and 37 human participants. Participants filled three questionnaires about their personality, sense of humor and mental states regarding the interaction. This work reveals the existence of many links between behavioral and emotional cues and the mental states reported by human participants through self-report questionnaires. However, we have not found a clear connection between reported mental states and participants profiles.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76776505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. D’Mello, A. Olney, Nathaniel Blanchard, Borhan Samei, Xiaoyi Sun, Brooke Ward, Sean Kelly
We focus on data collection designs for the automated analysis of teacher-student interactions in live classrooms with the goal of identifying instructional activities (e.g., lecturing, discussion) and assessing the quality of dialogic instruction (e.g., analysis of questions). Our designs were motivated by multiple technical requirements and constraints. Most importantly, teachers could be individually micfied but their audio needed to be of excellent quality for automatic speech recognition (ASR) and spoken utterance segmentation. Individual students could not be micfied but classroom audio quality only needed to be sufficient to detect student spoken utterances. Visual information could only be recorded if students could not be identified. Design 1 used an omnidirectional laptop microphone to record both teacher and classroom audio and was quickly deemed unsuitable. In Designs 2 and 3, teachers wore a wireless Samson AirLine 77 vocal headset system, which is a unidirectional microphone with a cardioid pickup pattern. In Design 2, classroom audio was recorded with dual first- generation Microsoft Kinects placed at the front corners of the class. Design 3 used a Crown PZM-30D pressure zone microphone mounted on the blackboard to record classroom audio. Designs 2 and 3 were tested by recording audio in 38 live middle school classrooms from six U.S. schools while trained human coders simultaneously performed live coding of classroom discourse. Qualitative and quantitative analyses revealed that Design 3 was suitable for three of our core tasks: (1) ASR on teacher speech (word recognition rate of 66% and word overlap rate of 69% using Google Speech ASR engine); (2) teacher utterance segmentation (F-measure of 97%); and (3) student utterance segmentation (F-measure of 66%). Ideas to incorporate video and skeletal tracking with dual second-generation Kinects to produce Design 4 are discussed.
{"title":"Multimodal Capture of Teacher-Student Interactions for Automated Dialogic Analysis in Live Classrooms","authors":"S. D’Mello, A. Olney, Nathaniel Blanchard, Borhan Samei, Xiaoyi Sun, Brooke Ward, Sean Kelly","doi":"10.1145/2818346.2830602","DOIUrl":"https://doi.org/10.1145/2818346.2830602","url":null,"abstract":"We focus on data collection designs for the automated analysis of teacher-student interactions in live classrooms with the goal of identifying instructional activities (e.g., lecturing, discussion) and assessing the quality of dialogic instruction (e.g., analysis of questions). Our designs were motivated by multiple technical requirements and constraints. Most importantly, teachers could be individually micfied but their audio needed to be of excellent quality for automatic speech recognition (ASR) and spoken utterance segmentation. Individual students could not be micfied but classroom audio quality only needed to be sufficient to detect student spoken utterances. Visual information could only be recorded if students could not be identified. Design 1 used an omnidirectional laptop microphone to record both teacher and classroom audio and was quickly deemed unsuitable. In Designs 2 and 3, teachers wore a wireless Samson AirLine 77 vocal headset system, which is a unidirectional microphone with a cardioid pickup pattern. In Design 2, classroom audio was recorded with dual first- generation Microsoft Kinects placed at the front corners of the class. Design 3 used a Crown PZM-30D pressure zone microphone mounted on the blackboard to record classroom audio. Designs 2 and 3 were tested by recording audio in 38 live middle school classrooms from six U.S. schools while trained human coders simultaneously performed live coding of classroom discourse. Qualitative and quantitative analyses revealed that Design 3 was suitable for three of our core tasks: (1) ASR on teacher speech (word recognition rate of 66% and word overlap rate of 69% using Google Speech ASR engine); (2) teacher utterance segmentation (F-measure of 97%); and (3) student utterance segmentation (F-measure of 66%). Ideas to incorporate video and skeletal tracking with dual second-generation Kinects to produce Design 4 are discussed.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72882724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}