RTF: Recursive TransFusion for Multi-Modal Image Synthesis

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-02-24 DOI:10.1109/TIP.2025.3541877

Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao

{"title":"RTF: Recursive TransFusion for Multi-Modal Image Synthesis","authors":"Bing Cao;Guoliang Qi;Jiaming Zhao;Pengfei Zhu;Qinghua Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3541877","DOIUrl":null,"url":null,"abstract":"Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at <uri>https://github.com/guoliangq/RTF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1573-1587"},"PeriodicalIF":13.7000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10901869/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-modal image synthesis is crucial for obtaining complete modalities due to the imaging restrictions in reality. Current methods, primarily CNN-based models, find it challenging to extract global representations because of local inductive bias, leading to synthetic structure deformation or color distortion. Despite the significant global representation ability of transformer in capturing long-range dependencies, its huge parameter size requires considerable training data. Multi-modal synthesis solely based on one of the two structures makes it hard to extract comprehensive information from each modality with limited data. To tackle this dilemma, we propose a simple yet effective Recursive TransFusion (RTF) framework for multi-modal image synthesis. Specifically, we develop a TransFusion unit to integrate local knowledge extracted from the individual modality by connecting a CNN-based local representation block (LRB) and a transformer-based global fusion block (GFB) via a feature translating gate (FTG). Considering the numerous parameters introduced by the transformer, we further unfold a TransFusion unit with recursive constraint repeatedly, forming recursive TransFusion (RTF), which progressively extracts multi-modal information at different depths. Our RTF remarkably reduces network parameters while maintaining superior performance. Extensive experiments validate our superiority against the competing methods on multiple benchmarks. The source code will be available at https://github.com/guoliangq/RTF.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RTF：多模态图像合成的递归输血

由于现实中成像条件的限制，多模态图像合成是获得完整模态的关键。目前的方法，主要是基于cnn的模型，发现由于局部归纳偏差导致合成结构变形或颜色失真，难以提取全局表示。尽管变压器在捕获远程依赖关系方面具有显著的全局表示能力，但其庞大的参数规模需要大量的训练数据。仅基于两种结构中的一种进行多模态综合，在数据有限的情况下很难从每个模态中提取全面的信息。为了解决这个难题，我们提出了一个简单而有效的递归输血（RTF）框架，用于多模态图像合成。具体来说，我们开发了一个输血单元，通过特征转换门（FTG）连接基于cnn的局部表示块（LRB）和基于变压器的全局融合块（GFB），来整合从个体模态中提取的局部知识。考虑到变压器引入的众多参数，我们进一步展开具有递归约束的输液单元，形成递归输液（RTF），逐步提取不同深度的多模态信息。我们的RTF显著减少了网络参数，同时保持了卓越的性能。大量的实验验证了我们在多个基准测试中的优势。源代码可从https://github.com/guoliangq/RTF获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量