RTF: Recursive TransFusion for Multi-Modal Image Synthesis

Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao
{"title":"RTF: Recursive TransFusion for Multi-Modal Image Synthesis","authors":"Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3541877","DOIUrl":null,"url":null,"abstract":"Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at <uri>https://github.com/guoliangq/RTF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1573-1587"},"PeriodicalIF":13.7000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10901869/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at https://github.com/guoliangq/RTF.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RTF:多模态图像合成的递归输血
由于现实中成像条件的限制,多模态图像合成是获得完整模态的关键。目前的方法,主要是基于cnn的模型,发现由于局部归纳偏差导致合成结构变形或颜色失真,难以提取全局表示。尽管变压器在捕获远程依赖关系方面具有显著的全局表示能力,但其庞大的参数规模需要大量的训练数据。仅基于两种结构中的一种进行多模态综合,在数据有限的情况下很难从每个模态中提取全面的信息。为了解决这个难题,我们提出了一个简单而有效的递归输血(RTF)框架,用于多模态图像合成。具体来说,我们开发了一个输血单元,通过特征转换门(FTG)连接基于cnn的局部表示块(LRB)和基于变压器的全局融合块(GFB),来整合从个体模态中提取的局部知识。考虑到变压器引入的众多参数,我们进一步展开具有递归约束的输液单元,形成递归输液(RTF),逐步提取不同深度的多模态信息。我们的RTF显著减少了网络参数,同时保持了卓越的性能。大量的实验验证了我们在多个基准测试中的优势。源代码可从https://github.com/guoliangq/RTF获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Unified Framework for Backdoor Trigger Segmentation Zero-Pose-Prior NeRF: Recursive Radiance Field Reconstruction From Unposed and Unordered Images A Multi-level Self-Distillation-Based Unified Tracker for Efficient RGB-T Tracking. ReconX: Reconstruct Any Scene From Sparse Views With Video Diffusion Model Exploring Hierarchical Cross-Modal Correlation Consistency for Partial Mismatching.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1