CGCN:用于少量时间动作定位的上下文图卷积网络

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-10-15 DOI:10.1016/j.ipm.2024.103926
Shihui Zhang , Houlin Wang , Lei Wang , Xueqiang Han , Qing Tian
{"title":"CGCN:用于少量时间动作定位的上下文图卷积网络","authors":"Shihui Zhang ,&nbsp;Houlin Wang ,&nbsp;Lei Wang ,&nbsp;Xueqiang Han ,&nbsp;Qing Tian","doi":"10.1016/j.ipm.2024.103926","DOIUrl":null,"url":null,"abstract":"<div><div>Localizing human actions in videos has attracted extensive attention from industry and academia. Few-Shot Temporal Action Localization (FS-TAL) aims to detect human actions in untrimmed videos using a limited number of training samples. Existing FS-TAL methods usually ignore the semantic context between video snippets, making it difficult to detect actions during the query process. In this paper, we propose a novel FS-TAL method named Context Graph Convolutional Network (CGCN) which employs multi-scale graph convolution to aggregate semantic context between video snippets in addition to exploiting their temporal context. Specifically, CGCN constructs a graph for each scale of a video, where each video snippet is a node, and the relationships between the snippets are edges. There are three types of edges, namely sequence edges, intra-action edges, and inter-action edges. CGCN establishes sequence edges to enhance temporal expression. Intra-action edges utilize hyperbolic space to encapsulate context among video snippets within each action, while inter-action edges leverage Euclidean space to capture similar semantics between different actions. Through graph convolution on each scale, CGCN enables the acquisition of richer and context-aware video representations. Experiments demonstrate CGCN outperforms the second-best method by 4.5%/0.9% and 4.3%/0.9% mAP on the ActivityNet and THUMOS14 datasets in one-shot/five-shot scenarios, respectively, at [email protected]. The source code can be found in <span><span>https://github.com/mugenggeng/CGCN.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CGCN: Context graph convolutional network for few-shot temporal action localization\",\"authors\":\"Shihui Zhang ,&nbsp;Houlin Wang ,&nbsp;Lei Wang ,&nbsp;Xueqiang Han ,&nbsp;Qing Tian\",\"doi\":\"10.1016/j.ipm.2024.103926\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Localizing human actions in videos has attracted extensive attention from industry and academia. Few-Shot Temporal Action Localization (FS-TAL) aims to detect human actions in untrimmed videos using a limited number of training samples. Existing FS-TAL methods usually ignore the semantic context between video snippets, making it difficult to detect actions during the query process. In this paper, we propose a novel FS-TAL method named Context Graph Convolutional Network (CGCN) which employs multi-scale graph convolution to aggregate semantic context between video snippets in addition to exploiting their temporal context. Specifically, CGCN constructs a graph for each scale of a video, where each video snippet is a node, and the relationships between the snippets are edges. There are three types of edges, namely sequence edges, intra-action edges, and inter-action edges. CGCN establishes sequence edges to enhance temporal expression. Intra-action edges utilize hyperbolic space to encapsulate context among video snippets within each action, while inter-action edges leverage Euclidean space to capture similar semantics between different actions. Through graph convolution on each scale, CGCN enables the acquisition of richer and context-aware video representations. Experiments demonstrate CGCN outperforms the second-best method by 4.5%/0.9% and 4.3%/0.9% mAP on the ActivityNet and THUMOS14 datasets in one-shot/five-shot scenarios, respectively, at [email protected]. The source code can be found in <span><span>https://github.com/mugenggeng/CGCN.git</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324002851\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002851","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

视频中的人类动作定位引起了业界和学术界的广泛关注。少镜头时态动作定位(FS-TAL)旨在利用有限的训练样本检测未剪辑视频中的人类动作。现有的 FS-TAL 方法通常会忽略视频片段之间的语义上下文,因此很难在查询过程中检测到动作。在本文中,我们提出了一种名为 "上下文图卷积网络(CGCN)"的新型 FS-TAL 方法,该方法除了利用视频片段的时间上下文外,还利用多尺度图卷积来聚合视频片段之间的语义上下文。具体来说,CGCN 为视频的每个尺度构建一个图,其中每个视频片段是一个节点,片段之间的关系是边。边有三种类型,即序列边、动作内边和动作间边。CGCN 通过建立序列边缘来增强时间表达能力。动作内边缘利用双曲空间来封装每个动作中视频片段之间的上下文,而动作间边缘则利用欧几里得空间来捕捉不同动作之间的相似语义。通过在每个尺度上进行图卷积,CGCN 能够获得更丰富的上下文感知视频表示。实验证明,在ActivityNet和THUMOS14数据集上,CGCN在一帧/五帧场景下的mAP分别比第二好的方法高出4.5%/0.9%和4.3%/0.9%,详情请访问[email protected]。源代码见 https://github.com/mugenggeng/CGCN.git。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CGCN: Context graph convolutional network for few-shot temporal action localization
Localizing human actions in videos has attracted extensive attention from industry and academia. Few-Shot Temporal Action Localization (FS-TAL) aims to detect human actions in untrimmed videos using a limited number of training samples. Existing FS-TAL methods usually ignore the semantic context between video snippets, making it difficult to detect actions during the query process. In this paper, we propose a novel FS-TAL method named Context Graph Convolutional Network (CGCN) which employs multi-scale graph convolution to aggregate semantic context between video snippets in addition to exploiting their temporal context. Specifically, CGCN constructs a graph for each scale of a video, where each video snippet is a node, and the relationships between the snippets are edges. There are three types of edges, namely sequence edges, intra-action edges, and inter-action edges. CGCN establishes sequence edges to enhance temporal expression. Intra-action edges utilize hyperbolic space to encapsulate context among video snippets within each action, while inter-action edges leverage Euclidean space to capture similar semantics between different actions. Through graph convolution on each scale, CGCN enables the acquisition of richer and context-aware video representations. Experiments demonstrate CGCN outperforms the second-best method by 4.5%/0.9% and 4.3%/0.9% mAP on the ActivityNet and THUMOS14 datasets in one-shot/five-shot scenarios, respectively, at [email protected]. The source code can be found in https://github.com/mugenggeng/CGCN.git.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
期刊最新文献
ME3A: A Multimodal Entity Entailment framework for multimodal Entity Alignment Hierarchical multi-label text classification of tourism resources using a label-aware dual graph attention network Impact of economic and socio-political risk factors on sovereign credit ratings Higher-order structure based node importance evaluation in directed networks Membership inference attacks via spatial projection-based relative information loss in MLaaS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1