通过大型视觉语言模型进行基于语义感知的帧-事件融合模式识别

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-10-10 DOI:10.1016/j.patcog.2024.111080

Dong Li , Jiandong Jin , Yuhao Zhang , Yanlin Zhong , Yaoyang Wu , Lan Chen , Xiao Wang , Bin Luo

{"title":"通过大型视觉语言模型进行基于语义感知的帧-事件融合模式识别","authors":"Dong Li , Jiandong Jin , Yuhao Zhang , Yanlin Zhong , Yaoyang Wu , Lan Chen , Xiao Wang , Bin Luo","doi":"10.1016/j.patcog.2024.111080","DOIUrl":null,"url":null,"abstract":"<div><div>Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from two key issues: (1). They attempt to directly learn a mapping from the input vision modality to the semantic labels. This approach often leads to sub-optimal results due to the disparity between the input and semantic labels; (2). They utilize small-scale backbone networks for the extraction of RGB and Event input features, thus these models fail to harness the recent performance advancements of large-scale visual-language models. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision–language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering and polish using ChatGPT, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code has been released at <span><span>https://github.com/Event-AHU/SAFE_LargeVLM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111080"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic-aware frame-event fusion based pattern recognition via large vision–language models\",\"authors\":\"Dong Li , Jiandong Jin , Yuhao Zhang , Yanlin Zhong , Yaoyang Wu , Lan Chen , Xiao Wang , Bin Luo\",\"doi\":\"10.1016/j.patcog.2024.111080\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from two key issues: (1). They attempt to directly learn a mapping from the input vision modality to the semantic labels. This approach often leads to sub-optimal results due to the disparity between the input and semantic labels; (2). They utilize small-scale backbone networks for the extraction of RGB and Event input features, thus these models fail to harness the recent performance advancements of large-scale visual-language models. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision–language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering and polish using ChatGPT, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code has been released at <span><span>https://github.com/Event-AHU/SAFE_LargeVLM</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"158 \",\"pages\":\"Article 111080\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008318\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008318","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，通过融合 RGB 帧和事件流进行模式识别已成为一个新的研究领域。目前的方法通常采用骨干网络来单独提取 RGB 帧和事件流的特征，然后融合这些特征进行模式识别。然而，我们认为这些方法可能存在两个关键问题：(1).它们试图直接学习从输入视觉模式到语义标签的映射。由于输入和语义标签之间的差异，这种方法往往会导致次优结果；(2).它们利用小规模骨干网络提取 RGB 和事件输入特征，因此这些模型无法利用大规模视觉语言模型最近的性能进步。在本研究中，我们利用预先训练好的大规模视觉语言模型，引入了一种整合语义标签、RGB 帧和事件流的新型模式识别框架。具体来说，在给定输入的 RGB 帧、事件流和所有预定义的语义标签后，我们采用预先训练好的大规模视觉模型（CLIP 视觉编码器）来提取 RGB 和事件特征。在处理语义标签时，我们首先使用 ChatGPT 通过提示工程和润色将其转换为语言描述，然后使用预先训练好的大规模语言模型（CLIP 文本编码器）获取语义特征。随后，我们使用多模态变换器网络整合 RGB/事件特征和语义特征。由此产生的帧和事件标记将通过自关注层进一步放大。同时，我们建议通过交叉关注来增强文本标记和 RGB/Event 标记之间的交互。最后，我们利用自注意层和前馈层将所有三种模式整合在一起进行识别。在 HARDVS 和 PokerEvent 数据集上进行的综合实验充分证明了我们提出的 SAFE 模型的有效性。源代码已在 https://github.com/Event-AHU/SAFE_LargeVLM 上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Semantic-aware frame-event fusion based pattern recognition via large vision–language models

Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from two key issues: (1). They attempt to directly learn a mapping from the input vision modality to the semantic labels. This approach often leads to sub-optimal results due to the disparity between the input and semantic labels; (2). They utilize small-scale backbone networks for the extraction of RGB and Event input features, thus these models fail to harness the recent performance advancements of large-scale visual-language models. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision–language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering and polish using ChatGPT, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code has been released at https://github.com/Event-AHU/SAFE_LargeVLM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.