Cross-Modality Distillation for Multi-Modal Tracking

Tianlu Zhang;Qiang Zhang;Kurt Debattista;Jungong Han
{"title":"Cross-Modality Distillation for Multi-Modal Tracking","authors":"Tianlu Zhang;Qiang Zhang;Kurt Debattista;Jungong Han","doi":"10.1109/TPAMI.2025.3555485","DOIUrl":null,"url":null,"abstract":"Contemporary multi-modal trackers achieve strong performance by leveraging complex backbones and fusion strategies, but this comes at the cost of computational efficiency, limiting their deployment in resource-constrained settings. On the other hand, compact multi-modal trackers are more efficient but often suffer from reduced performance due to limited feature representation. To mitigate the performance gap between compact and more complex trackers, we introduce a cross-modality distillation framework. This framework includes a complementarity-aware mask autoencoder designed to enhance cross-modal interactions by selectively masking patches within a modality, thereby forcing the model to learn more robust multi-modal representations. Additionally, we present a specific-common feature distillation module that transfers both modality-specific and shared information from a more powerful model’s backbone to the compact model. Moreover, we develop a multi-path selection distillation module to guide a simple fusion module in learning more accurate multi-modal information from a sophisticated fusion mechanism using multiple paths. Extensive experiments on six multi-modal tracking benchmarks demonstrate that the proposed tracker, despite being lightweight, outperforms most state-of-the-art methods, highlighting its effectiveness. Notably, our tiny variant achieves a PR score of 67.5% on LasHeR, a PR score of 58.5% on DepthTrack, and a PR score of 73.1% on VisEvent with only 6.5 M parameters, while operating at 126 FPS on an NVIDIA 2080Ti GPU.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5847-5865"},"PeriodicalIF":18.6000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10943265/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Contemporary multi-modal trackers achieve strong performance by leveraging complex backbones and fusion strategies, but this comes at the cost of computational efficiency, limiting their deployment in resource-constrained settings. On the other hand, compact multi-modal trackers are more efficient but often suffer from reduced performance due to limited feature representation. To mitigate the performance gap between compact and more complex trackers, we introduce a cross-modality distillation framework. This framework includes a complementarity-aware mask autoencoder designed to enhance cross-modal interactions by selectively masking patches within a modality, thereby forcing the model to learn more robust multi-modal representations. Additionally, we present a specific-common feature distillation module that transfers both modality-specific and shared information from a more powerful model’s backbone to the compact model. Moreover, we develop a multi-path selection distillation module to guide a simple fusion module in learning more accurate multi-modal information from a sophisticated fusion mechanism using multiple paths. Extensive experiments on six multi-modal tracking benchmarks demonstrate that the proposed tracker, despite being lightweight, outperforms most state-of-the-art methods, highlighting its effectiveness. Notably, our tiny variant achieves a PR score of 67.5% on LasHeR, a PR score of 58.5% on DepthTrack, and a PR score of 73.1% on VisEvent with only 6.5 M parameters, while operating at 126 FPS on an NVIDIA 2080Ti GPU.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多模态跟踪的跨模态蒸馏
现代多模式跟踪器通过利用复杂的主干和融合策略来实现强大的性能,但这是以计算效率为代价的,限制了它们在资源受限环境中的部署。另一方面,紧凑的多模态跟踪器效率更高,但由于特征表示的限制,往往会降低性能。为了减轻紧凑和更复杂跟踪器之间的性能差距,我们引入了一个跨模态蒸馏框架。该框架包括一个互补性感知掩码自编码器,旨在通过选择性地掩码模态内的补丁来增强跨模态交互,从而迫使模型学习更健壮的多模态表示。此外,我们提出了一个特定的公共特征蒸馏模块,该模块将特定于模态的和共享的信息从更强大的模型主干传输到紧凑模型。此外,我们开发了一个多路径选择蒸馏模块,以指导简单的融合模块使用多路径从复杂的融合机制中学习更准确的多模态信息。在六个多模态跟踪基准上进行的大量实验表明,所提出的跟踪器尽管重量轻,但性能优于大多数最先进的方法,突出了其有效性。值得注意的是,我们的小变体在LasHeR上的PR得分为67.5%,在DepthTrack上的PR得分为58.5%,在VisEvent上的PR得分为73.1%,只有6.5 M个参数,而在NVIDIA 2080Ti GPU上以126 FPS运行。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C2R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement. Principled Multimodal Representation Learning. Class-Distribution-Aware Pseudo-Labeling for Semi-Supervised Multi-Label Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1