High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference

IF 14.7 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2024-09-12 DOI:10.1016/j.inffus.2024.102665
{"title":"High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference","authors":"","doi":"10.1016/j.inffus.2024.102665","DOIUrl":null,"url":null,"abstract":"<div><p>RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at <span><span>https://github.com/tzz-ahu</span><svg><path></path></svg></span> for academic purposes.</p></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":null,"pages":null},"PeriodicalIF":14.7000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524004433","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过渐进交互和时态模态差异混合融合技术实现高性能 RGB 热视频物体检测
RGBT 视频物体检测(RGBT VOD)是对可见光和热光谱视频中的预定义物体进行定位和分类。RGBT VOD 的关键问题在于有效整合多模态信息以提高检测性能。目前的多模态融合方法主要采用中间融合策略,但固有的模态差异直接影响了多模态融合的效果。虽然早期融合策略减少了网络中间阶段的模态差距,但实现不同模态之间的深度特征交互仍具有挑战性。在这项工作中,我们提出了一种名为 PTMNet 的新型混合融合网络,它有效地结合了渐进交互的早期融合策略和时态模态差异的中期融合策略,以实现高性能的 RGBT VOD。具体而言,我们将每种模态作为主模态,通过渐进式交互实现与作为辅助信息的其他模态的早期融合。这样的设计不仅缓解了模态差距,还促进了中间融合。时间-模态差异通过空间偏移对时间信息进行建模,并利用模态间的特征擦除来促使网络关注两种模态中的共享对象。混合融合仅使用三个输入帧就能达到很高的检测精度,这使得我们的 PTMNet 实现了很高的推理速度。实验结果表明,我们的方法在 VT-VOD50 数据集上实现了最先进的性能,而且运行速度超过 70 FPS。该代码将在 https://github.com/tzz-ahu 上免费发布,用于学术研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
期刊最新文献
Review of multimodal machine learning approaches in healthcare Multimodal fusion for large-scale traffic prediction with heterogeneous retentive networks Scalable data fusion via a scale-based hierarchical framework: Adapting to multi-source and multi-scale scenarios High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference Tensor-based unsupervised feature selection for error-robust handling of unbalanced incomplete multi-view data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1