Sarah Ita Levitan, James Shin, Ivy Chen, Julia Hirschberg
Humans are notoriously poor at detecting deception --- most are worse than chance. To address this issue we have developed LieCatcher, a single-player web-based Game With A Purpose (GWAP) that allows players to assess their lie detection skills while providing human judgments of deceptive speech. Players listen to audio recordings drawn from a corpus of deceptive and non-deceptive interview dialogues, and guess if the speaker is lying or telling the truth. They are awarded points for correct guesses and at the end of the game they receive a score summarizing their performance at lie detection. We present the game design and implementation, and describe a crowdsourcing experiment conducted to study perceived deception.
{"title":"LieCatcher: Game Framework for Collecting Human Judgments of Deceptive Speech","authors":"Sarah Ita Levitan, James Shin, Ivy Chen, Julia Hirschberg","doi":"10.1145/3382507.3421166","DOIUrl":"https://doi.org/10.1145/3382507.3421166","url":null,"abstract":"Humans are notoriously poor at detecting deception --- most are worse than chance. To address this issue we have developed LieCatcher, a single-player web-based Game With A Purpose (GWAP) that allows players to assess their lie detection skills while providing human judgments of deceptive speech. Players listen to audio recordings drawn from a corpus of deceptive and non-deceptive interview dialogues, and guess if the speaker is lying or telling the truth. They are awarded points for correct guesses and at the end of the game they receive a score summarizing their performance at lie detection. We present the game design and implementation, and describe a crowdsourcing experiment conducted to study perceived deception.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129977530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Musical performance can be thought of in multimodal terms - physical interaction with musical instruments produces sound output, often while the performer is visually reading a score. Digital Musical Instrument (DMI) design merges tenets of HCI and musical instrument practice. Audiovisual performance and other forms of multimedia might benefit from multimodal thinking. This keynote revisits two decades of interactive music practice that has paralleled the development of the field of multimodal interaction research. The BioMuse was an early digital musical instrument system using EMG muscle sensing that was extended by a second mode of sensing, allowing effort and position to be two complementary modalities [1]. The Haptic Wave applied principles of cross-modal information display to create a haptic audio editor enabling visually impaired audio producers to 'feel' audio waveforms they could not see in a graphical user interface [2]. VJ culture extends the idea of music DJs to create audiovisual cultural experiences. AVUIs were a set of creative coding tools that enabled the convergence of performance UI and creative visual output [3]. The Orchestra of Rocks is a continuing collaboration with visual artist Uta Kogelsberger that has manifested itself through physical and virtual forms - allowing multimodality over time [4]. Be it a physical exhibition in a gallery or audio reactive 3D animation on YouTube 360, the multiple modes in which an artwork is articulated support its original conceptual foundations. These four projects situate multimodal interaction at the heart of artistic research.
{"title":"Musical Multimodal Interaction: From Bodies to Ecologies","authors":"Atau Tanaka","doi":"10.1145/3382507.3419444","DOIUrl":"https://doi.org/10.1145/3382507.3419444","url":null,"abstract":"Musical performance can be thought of in multimodal terms - physical interaction with musical instruments produces sound output, often while the performer is visually reading a score. Digital Musical Instrument (DMI) design merges tenets of HCI and musical instrument practice. Audiovisual performance and other forms of multimedia might benefit from multimodal thinking. This keynote revisits two decades of interactive music practice that has paralleled the development of the field of multimodal interaction research. The BioMuse was an early digital musical instrument system using EMG muscle sensing that was extended by a second mode of sensing, allowing effort and position to be two complementary modalities [1]. The Haptic Wave applied principles of cross-modal information display to create a haptic audio editor enabling visually impaired audio producers to 'feel' audio waveforms they could not see in a graphical user interface [2]. VJ culture extends the idea of music DJs to create audiovisual cultural experiences. AVUIs were a set of creative coding tools that enabled the convergence of performance UI and creative visual output [3]. The Orchestra of Rocks is a continuing collaboration with visual artist Uta Kogelsberger that has manifested itself through physical and virtual forms - allowing multimodality over time [4]. Be it a physical exhibition in a gallery or audio reactive 3D animation on YouTube 360, the multiple modes in which an artwork is articulated support its original conceptual foundations. These four projects situate multimodal interaction at the heart of artistic research.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116274243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the last decade, haptic actuators have improved in quality and efficiency, enabling easier implementation in user interfaces. One of the next steps towards a mature haptics field is a larger and more diverse toolset that enables designers and novices to explore with the design and implementation of haptic feedback in their projects. In this paper, we look at several design projects that utilize haptic force feedback to aid interaction between the user and product. We analysed the process interaction designers went through when developing their haptic user interfaces. Based on our insights, we identified requirements for a haptic force feedback authoring tool. We discuss how these requirements are addressed by 'Feelix', a tool that supports sketching and refinement of haptic force feedback effects.
{"title":"Facilitating Flexible Force Feedback Design with Feelix","authors":"Anke van Oosterhout, M. Bruns, Eve E. Hoggan","doi":"10.1145/3382507.3418819","DOIUrl":"https://doi.org/10.1145/3382507.3418819","url":null,"abstract":"In the last decade, haptic actuators have improved in quality and efficiency, enabling easier implementation in user interfaces. One of the next steps towards a mature haptics field is a larger and more diverse toolset that enables designers and novices to explore with the design and implementation of haptic feedback in their projects. In this paper, we look at several design projects that utilize haptic force feedback to aid interaction between the user and product. We analysed the process interaction designers went through when developing their haptic user interfaces. Based on our insights, we identified requirements for a haptic force feedback authoring tool. We discuss how these requirements are addressed by 'Feelix', a tool that supports sketching and refinement of haptic force feedback effects.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127657497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humor has a history as old as humanity. Humor often induces laughter and elicits amusement and engagement. Humorous behavior involves behavior manifested in different modalities including language, voice tone, and gestures. Thus, automatic understanding of humorous behavior requires multimodal behavior analysis. Humor detection is a well-established problem in Natural Language Processing but its multimodal analysis is less explored. In this paper, we present a context-aware hierarchical fusion network for multimodal punchline detection. The proposed neural architecture first fuses the modalities two by two and then fuses all three modalities. The network also models the context of the punchline using Gated Recurrent Unit(s). The model's performance is evaluated on UR-FUNNY database yielding state-of-the-art performance.
{"title":"Punchline Detection using Context-Aware Hierarchical Multimodal Fusion","authors":"Akshat Choube, M. Soleymani","doi":"10.1145/3382507.3418891","DOIUrl":"https://doi.org/10.1145/3382507.3418891","url":null,"abstract":"Humor has a history as old as humanity. Humor often induces laughter and elicits amusement and engagement. Humorous behavior involves behavior manifested in different modalities including language, voice tone, and gestures. Thus, automatic understanding of humorous behavior requires multimodal behavior analysis. Humor detection is a well-established problem in Natural Language Processing but its multimodal analysis is less explored. In this paper, we present a context-aware hierarchical fusion network for multimodal punchline detection. The proposed neural architecture first fuses the modalities two by two and then fuses all three modalities. The network also models the context of the punchline using Gated Recurrent Unit(s). The model's performance is evaluated on UR-FUNNY database yielding state-of-the-art performance.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116166926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toshiki Onishi, Arisa Yamauchi, Ryo Ishii, Y. Aono, Akihiro Miyata
In this work, as a first attempt to analyze the relationship between praising skills and human behavior in dialogue, we focus on head and face behavior. We create a new dialogue corpus including face and head behavior information of persons who give praise (praiser) and receive praise (receiver) and the degree of success of praising (praising score). We also create a machine learning model that uses features related to head and face behavior to estimate praising score, clarify which features of the praiser and receiver are important in estimating praising score. The analysis results showed that features of the praiser and receiver are important in estimating praising score and that features related to utterance, head, gaze, and chin were important. The analysis of the features of high importance revealed that the praiser and receiver should face each other without turning their heads to the left or right, and the longer the praiser's utterance, the more successful the praising.
{"title":"Analyzing Nonverbal Behaviors along with Praising","authors":"Toshiki Onishi, Arisa Yamauchi, Ryo Ishii, Y. Aono, Akihiro Miyata","doi":"10.1145/3382507.3418868","DOIUrl":"https://doi.org/10.1145/3382507.3418868","url":null,"abstract":"In this work, as a first attempt to analyze the relationship between praising skills and human behavior in dialogue, we focus on head and face behavior. We create a new dialogue corpus including face and head behavior information of persons who give praise (praiser) and receive praise (receiver) and the degree of success of praising (praising score). We also create a machine learning model that uses features related to head and face behavior to estimate praising score, clarify which features of the praiser and receiver are important in estimating praising score. The analysis results showed that features of the praiser and receiver are important in estimating praising score and that features related to utterance, head, gaze, and chin were important. The analysis of the features of high importance revealed that the praiser and receiver should face each other without turning their heads to the left or right, and the longer the praiser's utterance, the more successful the praising.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Yang, Jingjie Zeng, Tao Peng, Xi Luo, Jinghui Zhang, Hongfei Lin
The Legal Judgement Prediction (LJP) is now under the spotlight. And it usually consists of multiple sub-tasks, such as penalty prediction (fine and imprisonment) and the prediction of articles of law. For penalty prediction, they are often closely related to the trial process, especially the attitude analysis of criminal suspects, which will influence the judgment of the presiding judge to some extent. In this paper, we firstly construct a multi-modal dataset with 517 cases of intentional assault, which contains trial information as well as the attitude of the suspect. Then, we explore the relationship between suspect`s attitude and term of imprisonment. Finally, we use the proposed multi-modal model to predict the suspect's attitude, and compare it with several strong baselines. Our experimental results show that the attitude of the criminal suspect is closely related to the penalty prediction, which provides a new perspective for LJP.
{"title":"Leniency to those who confess?: Predicting the Legal Judgement via Multi-Modal Analysis","authors":"Liang Yang, Jingjie Zeng, Tao Peng, Xi Luo, Jinghui Zhang, Hongfei Lin","doi":"10.1145/3382507.3418893","DOIUrl":"https://doi.org/10.1145/3382507.3418893","url":null,"abstract":"The Legal Judgement Prediction (LJP) is now under the spotlight. And it usually consists of multiple sub-tasks, such as penalty prediction (fine and imprisonment) and the prediction of articles of law. For penalty prediction, they are often closely related to the trial process, especially the attitude analysis of criminal suspects, which will influence the judgment of the presiding judge to some extent. In this paper, we firstly construct a multi-modal dataset with 517 cases of intentional assault, which contains trial information as well as the attitude of the suspect. Then, we explore the relationship between suspect`s attitude and term of imprisonment. Finally, we use the proposed multi-modal model to predict the suspect's attitude, and compare it with several strong baselines. Our experimental results show that the attitude of the criminal suspect is closely related to the penalty prediction, which provides a new perspective for LJP.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121665215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zero-Shot Learning (ZSL) is a new paradigm in machine learning that aims to recognize the classes that are not present in the training data. Hence, this paradigm is capable of comprehending the categories that were never seen before. While deep learning has pushed the limits of unseen object recognition, ZSL for temporal problems such as unfamiliar gesture recognition (referred to as ZSGL) remain unexplored. ZSGL has the potential to result in efficient human-machine interfaces that can recognize and understand the spontaneous and conversational gestures of humans. In this regard, the objective of this work is to conceptualize, model and develop a framework to tackle ZSGL problems. The first step in the pipeline is to develop a database of gesture attributes that are representative of a range of categories. Next, a deep architecture consisting of convolutional and recurrent layers is proposed to jointly optimize the semantic and classification losses. Lastly, rigorous experiments are performed to compare the proposed model with respect to existing ZSL models on CGD 2013 and MSRC-12 datasets. In our preliminary work, we identified a list of 64 discriminative attributes related to gestures' morphological characteristics. Our approach yields an unseen class accuracy of (41%) which outperforms the state-of-the-art approaches by a considerable margin. Future work involves the following: 1. Modifying the existing architecture in order to improve the ZSL accuracy, 2. Augmenting the database of attributes to incorporate semantic properties, 3. Addressing the issue of data imbalance which is inherent to ZSL problems, and 4. Expanding this research to other domains such as surgeme and action recognition.
{"title":"Zero-Shot Learning for Gesture Recognition","authors":"Naveen Madapana","doi":"10.1145/3382507.3421161","DOIUrl":"https://doi.org/10.1145/3382507.3421161","url":null,"abstract":"Zero-Shot Learning (ZSL) is a new paradigm in machine learning that aims to recognize the classes that are not present in the training data. Hence, this paradigm is capable of comprehending the categories that were never seen before. While deep learning has pushed the limits of unseen object recognition, ZSL for temporal problems such as unfamiliar gesture recognition (referred to as ZSGL) remain unexplored. ZSGL has the potential to result in efficient human-machine interfaces that can recognize and understand the spontaneous and conversational gestures of humans. In this regard, the objective of this work is to conceptualize, model and develop a framework to tackle ZSGL problems. The first step in the pipeline is to develop a database of gesture attributes that are representative of a range of categories. Next, a deep architecture consisting of convolutional and recurrent layers is proposed to jointly optimize the semantic and classification losses. Lastly, rigorous experiments are performed to compare the proposed model with respect to existing ZSL models on CGD 2013 and MSRC-12 datasets. In our preliminary work, we identified a list of 64 discriminative attributes related to gestures' morphological characteristics. Our approach yields an unseen class accuracy of (41%) which outperforms the state-of-the-art approaches by a considerable margin. Future work involves the following: 1. Modifying the existing architecture in order to improve the ZSL accuracy, 2. Augmenting the database of attributes to incorporate semantic properties, 3. Addressing the issue of data imbalance which is inherent to ZSL problems, and 4. Expanding this research to other domains such as surgeme and action recognition.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121710632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skanda Muralidhar, E. Kleinlogel, E. Mayor, Adrian Bangerter, M. S. Mast, D. Gática-Pérez
Asynchronous video interviews (AVIs) are increasingly used by organizations in their hiring process. In this mode of interviewing, the applicants are asked to record their responses to predefined interview questions using a webcam via an online platform. AVIs have increased usage due to employers' perceived benefits in terms of costs and scale. However, little research has been conducted regarding applicants' reactions to these new interview methods. In this work, we investigate applicants' reactions to an AVI platform using self-reported measures previously validated in psychology literature. We also investigate the connections of these measures with nonverbal behavior displayed during the interviews. We find that participants who found the platform creepy and had concerns about privacy reported lower interview performance compared to participants who did not have such concerns. We also observe weak correlations between nonverbal cues displayed and these self-reported measures. Finally, inference experiments achieve overall low-performance w.r.t. to explaining applicants' reactions. Overall, our results reveal that participants who are not at ease with AVIs (i.e., high creepy ambiguity score) might be unfairly penalized. This has implications for improved hiring practices using AVIs.
{"title":"Understanding Applicants' Reactions to Asynchronous Video Interviews Through Self-reports and Nonverbal Cues","authors":"Skanda Muralidhar, E. Kleinlogel, E. Mayor, Adrian Bangerter, M. S. Mast, D. Gática-Pérez","doi":"10.1145/3382507.3418869","DOIUrl":"https://doi.org/10.1145/3382507.3418869","url":null,"abstract":"Asynchronous video interviews (AVIs) are increasingly used by organizations in their hiring process. In this mode of interviewing, the applicants are asked to record their responses to predefined interview questions using a webcam via an online platform. AVIs have increased usage due to employers' perceived benefits in terms of costs and scale. However, little research has been conducted regarding applicants' reactions to these new interview methods. In this work, we investigate applicants' reactions to an AVI platform using self-reported measures previously validated in psychology literature. We also investigate the connections of these measures with nonverbal behavior displayed during the interviews. We find that participants who found the platform creepy and had concerns about privacy reported lower interview performance compared to participants who did not have such concerns. We also observe weak correlations between nonverbal cues displayed and these self-reported measures. Finally, inference experiments achieve overall low-performance w.r.t. to explaining applicants' reactions. Overall, our results reveal that participants who are not at ease with AVIs (i.e., high creepy ambiguity score) might be unfairly penalized. This has implications for improved hiring practices using AVIs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131770988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic emotion recognition methods are sensitive to the variations across different datasets and their performance drops when evaluated across corpora. We can apply domain adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this problem. Though the DANN can detect and remove the bias between corpora, the bias between speakers still remains which results in reduced performance. In this paper, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL). Our experiments with multimodal data (speech, vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, suggesting that the SIDANN has a better domain adaptation ability than the DANN. Besides, the modality contribution analysis shows that the acoustic features are the most informative for arousal detection while the lexical features perform the best for valence detection.
{"title":"Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition","authors":"Yufeng Yin, Baiyu Huang, Yizhen Wu, M. Soleymani","doi":"10.1145/3382507.3418813","DOIUrl":"https://doi.org/10.1145/3382507.3418813","url":null,"abstract":"Automatic emotion recognition methods are sensitive to the variations across different datasets and their performance drops when evaluated across corpora. We can apply domain adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this problem. Though the DANN can detect and remove the bias between corpora, the bias between speakers still remains which results in reduced performance. In this paper, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL). Our experiments with multimodal data (speech, vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, suggesting that the SIDANN has a better domain adaptation ability than the DANN. Besides, the modality contribution analysis shows that the acoustic features are the most informative for arousal detection while the lexical features perform the best for valence detection.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lian Beenhakker, F. Salim, D. Postma, R. V. Delden, D. Reidsma, B. Beijnum
In Human Behaviour Understanding, social interaction is often modeled on the basis of lower level action recognition. The accuracy of this recognition has an impact on the system's capability to detect the higher level social events, and thus on the usefulness of the resulting system. We model team interactions in volleyball and investigate, through simulation of typical error patterns, how one can consider the required quality (in accuracy and in allowable types of errors) of the underlying action recognition for automated volleyball monitoring. Our proposed approach simulates different patterns of errors, grounded in related work in volleyball action recognition, on top of a manually annotated ground truth to model their different impact on the interaction recognition. Our results show that this can provide a means to quantify the effect of different type of classification errors on the overall quality of the system. Our chosen volleyball use case, in the rising field of sports monitoring, also addresses specific team related challenges in such a system and how these can be visualized to grasp the interdependencies. In our use case the first layer of our system classifies actions of individual players and the second layer recognizes multiplayer exercises and complexes (i.e. sequences in rallies) to enhance training. The experiments performed for this study investigated how errors at the action recognition layer propagate and cause errors at the complexes layer. We discuss the strengths and weaknesses of the layered system to model volleyball rallies. We also give indications regarding what kind of errors are causing more problems and what choices can follow from them. In our given context we suggest that for recognition of non-Freeball actions (e.g. smash, block) it is more important to achieve a higher accuracy, which can be done at the cost of accuracy of classification of Freeball actions (which are mostly plays between team members and are more interchangable as to their role in the complexes).
{"title":"How Good is Good Enough?: The Impact of Errors in Single Person Action Classification on the Modeling of Group Interactions in Volleyball","authors":"Lian Beenhakker, F. Salim, D. Postma, R. V. Delden, D. Reidsma, B. Beijnum","doi":"10.1145/3382507.3418846","DOIUrl":"https://doi.org/10.1145/3382507.3418846","url":null,"abstract":"In Human Behaviour Understanding, social interaction is often modeled on the basis of lower level action recognition. The accuracy of this recognition has an impact on the system's capability to detect the higher level social events, and thus on the usefulness of the resulting system. We model team interactions in volleyball and investigate, through simulation of typical error patterns, how one can consider the required quality (in accuracy and in allowable types of errors) of the underlying action recognition for automated volleyball monitoring. Our proposed approach simulates different patterns of errors, grounded in related work in volleyball action recognition, on top of a manually annotated ground truth to model their different impact on the interaction recognition. Our results show that this can provide a means to quantify the effect of different type of classification errors on the overall quality of the system. Our chosen volleyball use case, in the rising field of sports monitoring, also addresses specific team related challenges in such a system and how these can be visualized to grasp the interdependencies. In our use case the first layer of our system classifies actions of individual players and the second layer recognizes multiplayer exercises and complexes (i.e. sequences in rallies) to enhance training. The experiments performed for this study investigated how errors at the action recognition layer propagate and cause errors at the complexes layer. We discuss the strengths and weaknesses of the layered system to model volleyball rallies. We also give indications regarding what kind of errors are causing more problems and what choices can follow from them. In our given context we suggest that for recognition of non-Freeball actions (e.g. smash, block) it is more important to achieve a higher accuracy, which can be done at the cost of accuracy of classification of Freeball actions (which are mostly plays between team members and are more interchangable as to their role in the complexes).","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133581966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}