Sicheng Pan, Gary J.W. Xu, Kun Guo, Seop Hyeong Park, Hongliang Ding
{"title":"Video-Based Engagement Estimation of Game Streamers: An Interpretable Multimodal Neural Network Approach","authors":"Sicheng Pan, Gary J.W. Xu, Kun Guo, Seop Hyeong Park, Hongliang Ding","doi":"10.1109/tg.2023.3348230","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a non-intrusive and nonrestrictive multimodal deep learning model for estimating the engagement levels of game streamers. We incorporate three modalities from the streamers' videos (facial, pixel, and audio information) to train the multimodal neural network. Additionally, we introduce a novel interpretation technique that directly calculates the contribution of each modality to the model's classification performance without the need to retrain single modality models. Experimental results demonstrate that our model achieves an accuracy of 77.2% on the test set, with the sound modality identified as a key modality for engagement estimation. By utilizing the proposed interpretation technique, we further analyze the modality contributions of the model in handling different categories and samples from various players. This enhances the model's interpretability and reveals its limitations, as well as future directions for improvement. The proposed approach and findings have potential applications in the fields of game streaming and audience analysis, as well as in domains related to multimodal learning and affective computing.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"7 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tg.2023.3348230","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose a non-intrusive and nonrestrictive multimodal deep learning model for estimating the engagement levels of game streamers. We incorporate three modalities from the streamers' videos (facial, pixel, and audio information) to train the multimodal neural network. Additionally, we introduce a novel interpretation technique that directly calculates the contribution of each modality to the model's classification performance without the need to retrain single modality models. Experimental results demonstrate that our model achieves an accuracy of 77.2% on the test set, with the sound modality identified as a key modality for engagement estimation. By utilizing the proposed interpretation technique, we further analyze the modality contributions of the model in handling different categories and samples from various players. This enhances the model's interpretability and reveals its limitations, as well as future directions for improvement. The proposed approach and findings have potential applications in the fields of game streaming and audience analysis, as well as in domains related to multimodal learning and affective computing.