P. Bruneau, M. Stefas, H. Bredin, Johann Poignant, T. Tamisier, C. Barras
Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.
{"title":"A Visual Analytics Approach to Finding Factors Improving Automatic Speaker Identifications","authors":"P. Bruneau, M. Stefas, H. Bredin, Johann Poignant, T. Tamisier, C. Barras","doi":"10.1145/2818346.2820769","DOIUrl":"https://doi.org/10.1145/2818346.2820769","url":null,"abstract":"Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82363880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humans rely on eye gaze and hand manipulations extensively in their everyday activities. Most often, users gaze at an object to perceive it and then use their hands to manipulate it. We propose applying a multimodal, gaze plus free-space gesture approach to enable rapid, precise and expressive touch-free interactions. We show the input methods are highly complementary, mitigating issues of imprecision and limited expressivity in gaze-alone systems, and issues of targeting speed in gesture-alone systems. We extend an existing interaction taxonomy that naturally divides the gaze+gesture interaction space, which we then populate with a series of example interaction techniques to illustrate the character and utility of each method. We contextualize these interaction techniques in three example scenarios. In our user study, we pit our approach against five contemporary approaches; results show that gaze+gesture can outperform systems using gaze or gesture alone, and in general, approach the performance of "gold standard" input systems, such as the mouse and trackpad.
{"title":"Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions","authors":"Ishan Chatterjee, R. Xiao, Chris Harrison","doi":"10.1145/2818346.2820752","DOIUrl":"https://doi.org/10.1145/2818346.2820752","url":null,"abstract":"Humans rely on eye gaze and hand manipulations extensively in their everyday activities. Most often, users gaze at an object to perceive it and then use their hands to manipulate it. We propose applying a multimodal, gaze plus free-space gesture approach to enable rapid, precise and expressive touch-free interactions. We show the input methods are highly complementary, mitigating issues of imprecision and limited expressivity in gaze-alone systems, and issues of targeting speed in gesture-alone systems. We extend an existing interaction taxonomy that naturally divides the gaze+gesture interaction space, which we then populate with a series of example interaction techniques to illustrate the character and utility of each method. We contextualize these interaction techniques in three example scenarios. In our user study, we pit our approach against five contemporary approaches; results show that gaze+gesture can outperform systems using gaze or gesture alone, and in general, approach the performance of \"gold standard\" input systems, such as the mouse and trackpad.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82494451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. D. Kok, J. Hough, Felix Hülsmann, M. Botsch, David Schlangen, S. Kopp
We present a multimodal coaching system that supports online motor skill learning. In this domain, closed-loop interaction between the movements of the user and the action instructions by the system is an essential requirement. To achieve this, the actions of the user need to be measured and evaluated and the system must be able to give corrective instructions on the ongoing performance. Timely delivery of these instructions, particularly during execution of the motor skill by the user, is thus of the highest importance. Based on the results of an empirical study on motor skill coaching, we analyze the requirements for an interactive coaching system and present an architecture that combines motion analysis, dialogue management, and virtual human animation in a motion tracking and 3D virtual reality hardware setup. In a preliminary study we demonstrate that the current system is capable of delivering the closed-loop interaction that is required in the motor skill learning domain.
{"title":"A Multimodal System for Real-Time Action Instruction in Motor Skill Learning","authors":"I. D. Kok, J. Hough, Felix Hülsmann, M. Botsch, David Schlangen, S. Kopp","doi":"10.1145/2818346.2820746","DOIUrl":"https://doi.org/10.1145/2818346.2820746","url":null,"abstract":"We present a multimodal coaching system that supports online motor skill learning. In this domain, closed-loop interaction between the movements of the user and the action instructions by the system is an essential requirement. To achieve this, the actions of the user need to be measured and evaluated and the system must be able to give corrective instructions on the ongoing performance. Timely delivery of these instructions, particularly during execution of the motor skill by the user, is thus of the highest importance. Based on the results of an empirical study on motor skill coaching, we analyze the requirements for an interactive coaching system and present an architecture that combines motion analysis, dialogue management, and virtual human animation in a motion tracking and 3D virtual reality hardware setup. In a preliminary study we demonstrate that the current system is capable of delivering the closed-loop interaction that is required in the motor skill learning domain.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90107278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, intelligent agents are expected to be affect-sensitive as agents are becoming essential entities that supports computer-mediated tasks, especially in teaching and training. These agents use common natural modalities-such as facial expressions, gestures and eye gaze in order to recognize a user's affective state and respond accordingly. However, these nonverbal cues may not be universal as emotion recognition and expression differ from culture to culture. It is important that intelligent interfaces are equipped with the abilities to meet the challenge of cultural diversity to facilitate human-machine interaction particularly in Asia. Asians are known to be more passive and possess certain traits such as indirectness and non-confrontationalism, which lead to emotions such as (culture-specific form of) shyness and timidity. Therefore, a model based on other culture may not be applicable in an Asian setting, out-rulling a one-size-fits-all approach. This study is initiated to identify the discriminative markers of culture-specific emotions based on the multimodal interactions.
{"title":"A Computational Model of Culture-Specific Emotion Detection for Artificial Agents in the Learning Domain","authors":"Ganapreeta Renunathan Naidu","doi":"10.1145/2818346.2823307","DOIUrl":"https://doi.org/10.1145/2818346.2823307","url":null,"abstract":"Nowadays, intelligent agents are expected to be affect-sensitive as agents are becoming essential entities that supports computer-mediated tasks, especially in teaching and training. These agents use common natural modalities-such as facial expressions, gestures and eye gaze in order to recognize a user's affective state and respond accordingly. However, these nonverbal cues may not be universal as emotion recognition and expression differ from culture to culture. It is important that intelligent interfaces are equipped with the abilities to meet the challenge of cultural diversity to facilitate human-machine interaction particularly in Asia. Asians are known to be more passive and possess certain traits such as indirectness and non-confrontationalism, which lead to emotions such as (culture-specific form of) shyness and timidity. Therefore, a model based on other culture may not be applicable in an Asian setting, out-rulling a one-size-fits-all approach. This study is initiated to identify the discriminative markers of culture-specific emotions based on the multimodal interactions.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75694221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Research in learning analytics and educational data mining has recently become prominent in the fields of computer science and education. Most scholars in the field emphasize student learning and student data analytics; however, it is also important to focus on teaching analytics and teacher preparation because of their key roles in student learning, especially in K-12 learning environments. Nonverbal communication strategies play an important role in successful interpersonal communication of teachers with their students. In order to assist novice or practicing teachers with exhibiting open and affirmative nonverbal cues in their classrooms, we have designed a multimodal teaching platform with provisions for online feedback. We used an interactive teaching rehearsal software, TeachLivE, as our basic research environment. TeachLivE employs a digital puppetry paradigm as its core technology. Individuals walk into this virtual environment and interact with virtual students displayed on a large screen. They can practice classroom management, pedagogy and content delivery skills with a teaching plan in the TeachLivE environment. We have designed an experiment to evaluate the impact of an online nonverbal feedback application. In this experiment, different types of multimodal data have been collected during two experimental settings. These data include talk-time and nonverbal behaviors of the virtual students, captured in log files; talk time and full body tracking data of the participant; and video recording of the virtual classroom with the participant. 34 student teachers participated in this 30-minute experiment. In each of the settings, the participants were provided with teaching plans from which they taught. All the participants took part in both of the experimental settings. In order to have a balanced experiment design, half of the participants received nonverbal online feedback in their first session and the other half received this feedback in the second session. A visual indication was used for feedback each time the participant exhibited a closed, defensive posture. Based on recorded full-body tracking data, we observed that only those who received feedback in their first session demonstrated a significant number of open postures in the session containing no feedback. However, the post-questionnaire information indicated that all participants were more mindful of their body postures while teaching after they had participated in the study.
{"title":"Providing Real-time Feedback for Student Teachers in a Virtual Rehearsal Environment","authors":"R. Barmaki, C. Hughes","doi":"10.1145/2818346.2830604","DOIUrl":"https://doi.org/10.1145/2818346.2830604","url":null,"abstract":"Research in learning analytics and educational data mining has recently become prominent in the fields of computer science and education. Most scholars in the field emphasize student learning and student data analytics; however, it is also important to focus on teaching analytics and teacher preparation because of their key roles in student learning, especially in K-12 learning environments. Nonverbal communication strategies play an important role in successful interpersonal communication of teachers with their students. In order to assist novice or practicing teachers with exhibiting open and affirmative nonverbal cues in their classrooms, we have designed a multimodal teaching platform with provisions for online feedback. We used an interactive teaching rehearsal software, TeachLivE, as our basic research environment. TeachLivE employs a digital puppetry paradigm as its core technology. Individuals walk into this virtual environment and interact with virtual students displayed on a large screen. They can practice classroom management, pedagogy and content delivery skills with a teaching plan in the TeachLivE environment. We have designed an experiment to evaluate the impact of an online nonverbal feedback application. In this experiment, different types of multimodal data have been collected during two experimental settings. These data include talk-time and nonverbal behaviors of the virtual students, captured in log files; talk time and full body tracking data of the participant; and video recording of the virtual classroom with the participant. 34 student teachers participated in this 30-minute experiment. In each of the settings, the participants were provided with teaching plans from which they taught. All the participants took part in both of the experimental settings. In order to have a balanced experiment design, half of the participants received nonverbal online feedback in their first session and the other half received this feedback in the second session. A visual indication was used for feedback each time the participant exhibited a closed, defensive posture. Based on recorded full-body tracking data, we observed that only those who received feedback in their first session demonstrated a significant number of open postures in the session containing no feedback. However, the post-questionnaire information indicated that all participants were more mindful of their body postures while teaching after they had participated in the study.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76959158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prevailing social norms prohibit interrupting another person when they are speaking. In this research, simultaneous speech was investigated in groups of students as they jointly solved math problems and peer tutored one another. Analyses were based on the Math Data Corpus, which includes ground-truth performance coding and speech transcriptions. Simultaneous speech was elevated 120-143% during the most productive phase of problem solving, compared with matched intervals. It also was elevated 18-37% in students who were domain experts, compared with non-experts. Qualitative analyses revealed that experts differed from non-experts in the function of their interruptions. Analysis of these functional asymmetries produced nine key behaviors that were used to identify the dominant math expert in a group with 95-100% accuracy in three minutes. This research demonstrates that overlapped speech is a marker of group problem-solving progress and domain expertise. It provides valuable information for the emerging field of learning analytics.
{"title":"Spoken Interruptions Signal Productive Problem Solving and Domain Expertise in Mathematics","authors":"S. Oviatt, Kevin Hang, Jianlong Zhou, Fang Chen","doi":"10.1145/2818346.2820743","DOIUrl":"https://doi.org/10.1145/2818346.2820743","url":null,"abstract":"Prevailing social norms prohibit interrupting another person when they are speaking. In this research, simultaneous speech was investigated in groups of students as they jointly solved math problems and peer tutored one another. Analyses were based on the Math Data Corpus, which includes ground-truth performance coding and speech transcriptions. Simultaneous speech was elevated 120-143% during the most productive phase of problem solving, compared with matched intervals. It also was elevated 18-37% in students who were domain experts, compared with non-experts. Qualitative analyses revealed that experts differed from non-experts in the function of their interruptions. Analysis of these functional asymmetries produced nine key behaviors that were used to identify the dominant math expert in a group with 95-100% accuracy in three minutes. This research demonstrates that overlapped speech is a marker of group problem-solving progress and domain expertise. It provides valuable information for the emerging field of learning analytics.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"29 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79862327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federico Domínguez, K. Chiluiza, Vanessa Echeverría, X. Ochoa
The traditional recording of student interaction in classrooms has raised privacy concerns in both students and academics. However, the same students are happy to share their daily lives through social media. Perception of data ownership is the key factor in this paradox. This article proposes the design of a personal Multimodal Recording Device (MRD) that could capture the actions of its owner during lectures. The MRD would be able to capture close-range video, audio, writing, and other environmental signals. Differently from traditional centralized recording systems, students would have control over their own recorded data. They could decide to share their information in exchange of access to the recordings of the instructor, notes form their classmates, and analysis of, for example, their attention performance. By sharing their data, students participate in the co-creation of enhanced and synchronized course notes that will benefit all the participating students. This work presents details about how such a device could be build from available components. This work also discusses and evaluates the design of such device, including its foreseeable costs, scalability, flexibility, intrusiveness and recording quality.
{"title":"Multimodal Selfies: Designing a Multimodal Recording Device for Students in Traditional Classrooms","authors":"Federico Domínguez, K. Chiluiza, Vanessa Echeverría, X. Ochoa","doi":"10.1145/2818346.2830606","DOIUrl":"https://doi.org/10.1145/2818346.2830606","url":null,"abstract":"The traditional recording of student interaction in classrooms has raised privacy concerns in both students and academics. However, the same students are happy to share their daily lives through social media. Perception of data ownership is the key factor in this paradox. This article proposes the design of a personal Multimodal Recording Device (MRD) that could capture the actions of its owner during lectures. The MRD would be able to capture close-range video, audio, writing, and other environmental signals. Differently from traditional centralized recording systems, students would have control over their own recorded data. They could decide to share their information in exchange of access to the recordings of the instructor, notes form their classmates, and analysis of, for example, their attention performance. By sharing their data, students participate in the co-creation of enhanced and synchronized course notes that will benefit all the participating students. This work presents details about how such a device could be build from available components. This work also discusses and evaluates the design of such device, including its foreseeable costs, scalability, flexibility, intrusiveness and recording quality.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75843857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nimesha Ranasinghe, Gajan Suthokumar, Kuan-Yi Lee, E. Do
Flavor is often a pleasurable sensory perception we experience daily while eating and drinking. However, the sensation of flavor is rarely considered in the age of digital communication mainly due to the unavailability of flavors as a digitally controllable media. This paper introduces a digital instrument (Digital Flavor Synthesizing device), which actuates taste (electrical and thermal stimulation) and smell sensations (controlled scent emitting) together to simulate different flavors digitally. A preliminary user experiment is conducted to study the effectiveness of this method with predefined five different flavor stimuli. Experimental results show that the users were effectively able to identify different flavors such as minty, spicy, and lemony. Moreover, we outline several challenges ahead along with future possibilities of this technology. In summary, our work demonstrates a novel controllable instrument for flavor simulation, which will be valuable in multimodal interactive systems for rendering virtual flavors digitally.
{"title":"Digital Flavor: Towards Digitally Simulating Virtual Flavors","authors":"Nimesha Ranasinghe, Gajan Suthokumar, Kuan-Yi Lee, E. Do","doi":"10.1145/2818346.2820761","DOIUrl":"https://doi.org/10.1145/2818346.2820761","url":null,"abstract":"Flavor is often a pleasurable sensory perception we experience daily while eating and drinking. However, the sensation of flavor is rarely considered in the age of digital communication mainly due to the unavailability of flavors as a digitally controllable media. This paper introduces a digital instrument (Digital Flavor Synthesizing device), which actuates taste (electrical and thermal stimulation) and smell sensations (controlled scent emitting) together to simulate different flavors digitally. A preliminary user experiment is conducted to study the effectiveness of this method with predefined five different flavor stimuli. Experimental results show that the users were effectively able to identify different flavors such as minty, spicy, and lemony. Moreover, we outline several challenges ahead along with future possibilities of this technology. In summary, our work demonstrates a novel controllable instrument for flavor simulation, which will be valuable in multimodal interactive systems for rendering virtual flavors digitally.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89518836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Demonstrations","authors":"Stefan Scherer","doi":"10.1145/3252453","DOIUrl":"https://doi.org/10.1145/3252453","url":null,"abstract":"","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74037600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moitreya Chatterjee, Sunghyun Park, Louis-Philippe Morency, Stefan Scherer
Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.
{"title":"Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits","authors":"Moitreya Chatterjee, Sunghyun Park, Louis-Philippe Morency, Stefan Scherer","doi":"10.1145/2818346.2820747","DOIUrl":"https://doi.org/10.1145/2818346.2820747","url":null,"abstract":"Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74390640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}