Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition

IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-06-10 DOI:10.1109/TMM.2024.3407693
Ziyang Zhang;Xiang Tian;Yuan Zhang;Kailing Guo;Xiangmin Xu
{"title":"Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition","authors":"Ziyang Zhang;Xiang Tian;Yuan Zhang;Kailing Guo;Xiangmin Xu","doi":"10.1109/TMM.2024.3407693","DOIUrl":null,"url":null,"abstract":"Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10503-10513"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10552397/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
标签引导的动态时空融合技术用于基于视频的面部表情识别
基于视频的野生面部表情识别(FER)是一项常见但极具挑战性的任务。同时提取空间和时间特征是一种常见的方法,但由于空间和时间信息的不同性质,这种方法不一定总能获得最佳结果。然而,与基于图像的 FER 相比,基于视频的 FER 的结果有时并不理想,这表明在时空融合策略中没有充分利用每帧的空间信息,也没有对帧关系进行最佳建模。虽然帧标签与视频标签高度相关,但以往基于视频的 FER 方法却忽略了这一点。本文提出了标签引导的动态时空融合(LG-DSTF),它采用帧标签来增强空间特征的判别能力并引导时空融合。通过为每个帧分配一个视频标签,构建了两个辅助分类损失函数,以引导不同层次的空间特征学习。利用空间特征的均匀分布和标签分布之间的交叉熵来衡量每个帧的分类置信度。置信度值可作为动态权重,在空间特征的时间融合过程中强调关键帧。我们的 LG-DSTF 在 FER 基准上取得了最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
期刊最新文献
Frequency-Guided Spatial Adaptation for Camouflaged Object Detection Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization List of Reviewers Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1