GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition

IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI:10.1109/TMM.2024.3521658
Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang
{"title":"GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition","authors":"Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang","doi":"10.1109/TMM.2024.3521658","DOIUrl":null,"url":null,"abstract":"Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, <italic>aka</i> GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP<inline-formula><tex-math>$\\spadesuit$</tex-math></inline-formula>) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP<inline-formula><tex-math>$\\clubsuit$</tex-math></inline-formula>) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{\\bm {+9.4}}$</tex-math></inline-formula>), EGTEA (39.6%<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{\\bm {+5.5}}$</tex-math></inline-formula>), and CharadesEgo (31.5%<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{\\bm {+2.6}}$</tex-math></inline-formula>). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"401-413"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10817586/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, aka GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP$\spadesuit$) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP$\clubsuit$) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%$\uparrow$$_{\bm {+9.4}}$), EGTEA (39.6%$\uparrow$$_{\bm {+5.5}}$), and CharadesEgo (31.5%$\uparrow$$_{\bm {+2.6}}$). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GPT4Ego:释放零射击自我中心行动识别预训练模型的潜力
在大规模数据集上进行预训练的视觉语言模型(VLMs)在各种视觉识别任务中表现出令人印象深刻的性能。这一进步为在一些自我中心任务中的显著表现铺平了道路,零镜头自我中心动作识别(ZS-EAR),要求VLMs零镜头识别来自第一人称视频的动作,这些视频丰富了更现实的人类环境交互。通常,vlm将ZS-EAR处理为全局视频文本匹配任务,这通常会导致视觉和语言知识的次优对齐。我们提出了一种使用vlm的改进ZS-EAR方法,强调细粒度的概念-描述对齐,利用以自我为中心的视频中丰富的语义和上下文细节。在这项工作中,我们引入了一个简单但非常有效的VLM框架,即GPT4Ego,旨在增强视觉和语言之间概念和描述的细粒度一致性。具体而言,我们首先提出了一种新的自我导向文本提示(EgoTP $\spadesuit$)方案,该方案通过ChatGPT将词级类名演变为句子级上下文描述,通过精心设计的思维链文本提示,有效地提示与动作相关的文本上下文语义。此外,我们设计了一种新的面向自我的视觉解析(Ego-oriented Visual Parsing, EgoVP $\clubsuit$)策略,该策略通过在SAM的帮助下将全局级图像精炼为部分级上下文概念来学习与动作相关的视觉上下文语义。大量实验表明,GPT4Ego在三个大规模以自我为中心的视频基准上显著优于现有的vlm,即EPIC-KITCHENS-100 (33.2%$\uparrow$$_{\bm {+9.4}}$), EGTEA (39.6%$\uparrow$$_{\bm {+5.5}}$), and CharadesEgo (31.5%$\uparrow$$_{\bm {+2.6}}$). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
期刊最新文献
Frequency-Guided Spatial Adaptation for Camouflaged Object Detection Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization List of Reviewers Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1