Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos

Esra Acar, F. Hopfgartner, S. Albayrak
{"title":"Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos","authors":"Esra Acar, F. Hopfgartner, S. Albayrak","doi":"10.1109/CBMI.2015.7153603","DOIUrl":null,"url":null,"abstract":"When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMI.2015.7153603","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视频情感分析中习得多模态表征与密集轨迹的融合
在设计视频情感内容分析算法时,最重要的步骤之一是选择判别特征来有效地表示视频片段。大多数现有的情感内容分析方法要么使用低级的视听特征,要么基于这些低级特征生成手工制作的高级表示。在这项工作中,我们建议使用深度学习方法,特别是卷积神经网络(cnn),以便从原始数据中自动学习和提取中级表示。为此,我们通过使用Mel-Frequency倒谱系数(MFCC)和HSV色彩空间中的颜色值来利用视频的音频和视觉模态。为了进一步提高分析的性能,我们还结合了密集的基于轨迹的运动特征。通过多类支持向量机(svm)和融合机制,将音乐视频片段划分为代表Valence-Arousal (VA)空间四个象限的四个情感类别之一。在DEAP数据集的一个子集上获得的结果表明:(1)高级表征比低级特征表现得更好,(2)与所选择的表征无关,合并运动信息会带来显著的性能增益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Empirical evaluation of dissimilarity measures for 3D object retrieval with application to multi-feature retrieval A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing On the use of statistical semantics for metadata-based social image retrieval Automatic detection of repetitive actions in a video Hierarchical clustering pseudo-relevance feedback for social image search result diversification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1