使用视听描述符的视频博主态度识别

Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction Pub Date : 2016-11-12 DOI:10.1145/3011263.3011270

F. Haider, L. Cerrato, S. Luz, N. Campbell

{"title":"使用视听描述符的视频博主态度识别","authors":"F. Haider, L. Cerrato, S. Luz, N. Campbell","doi":"10.1145/3011263.3011270","DOIUrl":null,"url":null,"abstract":"In social media, vlogs (video blogs) are a form of unidirectional communication, where the vloggers (video bloggers) convey their messages (opinions, thoughts, etc.) to a potential audience which cannot give them feedback in real time. In this kind of communication, the non-verbal behaviour and personality impression of a video blogger tends to influence viewers' attention because non-verbal cues are correlated with the messages conveyed by a vlogger. In this study, we use the acoustic and visual features (body movements that are captured by low-level visual descriptors) to predict the six different attitudes (amusement, enthusiasm, friendliness, frustration, impatience and neutral) annotated in the speech of 10 video bloggers. The automatic detection of attitude can be helpful in a scenario where a machine has to automatically provide feedback to bloggers about their performance in terms of the extent to which they manage to engage the audience by displaying certain attitudes. Attitude recognition models are trained using the random forest classifier. Results show that: 1) acoustic features provide better accuracy than the visual features, 2) while fusion of audio and visual features does not increase overall accuracy, it improves the results for some attitudes and subjects, and 3) densely extracted histograms of flow provide better results than other visual descriptors. A three-class (positive, negative and neutral attitudes) problem has also been defined. Results for this setting show that feature fusion degrades overall classifier accuracy, and the classifiers perform better on the original six-class problem than on the three-class setting.","PeriodicalId":272696,"journal":{"name":"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Attitude recognition of video bloggers using audio-visual descriptors\",\"authors\":\"F. Haider, L. Cerrato, S. Luz, N. Campbell\",\"doi\":\"10.1145/3011263.3011270\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In social media, vlogs (video blogs) are a form of unidirectional communication, where the vloggers (video bloggers) convey their messages (opinions, thoughts, etc.) to a potential audience which cannot give them feedback in real time. In this kind of communication, the non-verbal behaviour and personality impression of a video blogger tends to influence viewers' attention because non-verbal cues are correlated with the messages conveyed by a vlogger. In this study, we use the acoustic and visual features (body movements that are captured by low-level visual descriptors) to predict the six different attitudes (amusement, enthusiasm, friendliness, frustration, impatience and neutral) annotated in the speech of 10 video bloggers. The automatic detection of attitude can be helpful in a scenario where a machine has to automatically provide feedback to bloggers about their performance in terms of the extent to which they manage to engage the audience by displaying certain attitudes. Attitude recognition models are trained using the random forest classifier. Results show that: 1) acoustic features provide better accuracy than the visual features, 2) while fusion of audio and visual features does not increase overall accuracy, it improves the results for some attitudes and subjects, and 3) densely extracted histograms of flow provide better results than other visual descriptors. A three-class (positive, negative and neutral attitudes) problem has also been defined. Results for this setting show that feature fusion degrades overall classifier accuracy, and the classifiers perform better on the original six-class problem than on the three-class setting.\",\"PeriodicalId\":272696,\"journal\":{\"name\":\"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3011263.3011270\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3011263.3011270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

在社交媒体中，vlog(视频博客)是一种单向的交流形式，视频博主将他们的信息(观点、想法等)传达给潜在的受众，而这些受众无法实时地给他们反馈。在这种交流中，视频博主的非语言行为和个性印象往往会影响观众的注意力，因为非语言线索与视频博主传达的信息是相关的。在这项研究中，我们使用声音和视觉特征(由低级视觉描述符捕获的身体动作)来预测10个视频博主演讲中注释的六种不同态度(娱乐，热情，友好，沮丧，不耐烦和中立)。这种态度的自动检测在一个场景中是很有帮助的，在这个场景中，机器必须自动向博主提供他们的表现反馈，根据他们通过展示某种态度来吸引观众的程度。使用随机森林分类器训练姿态识别模型。结果表明:1)声学特征比视觉特征提供了更好的准确率;2)视听特征融合不提高整体准确率，但对某些姿态和被试的结果有所改善;3)密集提取的流量直方图比其他视觉描述符提供了更好的结果。还定义了一个三类(积极、消极和中性态度)问题。该设置的结果表明，特征融合降低了分类器的整体精度，并且分类器在原始六类问题上的表现优于三类设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Attitude recognition of video bloggers using audio-visual descriptors

In social media, vlogs (video blogs) are a form of unidirectional communication, where the vloggers (video bloggers) convey their messages (opinions, thoughts, etc.) to a potential audience which cannot give them feedback in real time. In this kind of communication, the non-verbal behaviour and personality impression of a video blogger tends to influence viewers' attention because non-verbal cues are correlated with the messages conveyed by a vlogger. In this study, we use the acoustic and visual features (body movements that are captured by low-level visual descriptors) to predict the six different attitudes (amusement, enthusiasm, friendliness, frustration, impatience and neutral) annotated in the speech of 10 video bloggers. The automatic detection of attitude can be helpful in a scenario where a machine has to automatically provide feedback to bloggers about their performance in terms of the extent to which they manage to engage the audience by displaying certain attitudes. Attitude recognition models are trained using the random forest classifier. Results show that: 1) acoustic features provide better accuracy than the visual features, 2) while fusion of audio and visual features does not increase overall accuracy, it improves the results for some attitudes and subjects, and 3) densely extracted histograms of flow provide better results than other visual descriptors. A three-class (positive, negative and neutral attitudes) problem has also been defined. Results for this setting show that feature fusion degrades overall classifier accuracy, and the classifiers perform better on the original six-class problem than on the three-class setting.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction

自引率

0.00%

发文量