High-Fidelity and Efficient Pluralistic Image Completion with Transformers.

Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao
{"title":"High-Fidelity and Efficient Pluralistic Image Completion with Transformers.","authors":"Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao","doi":"10.1109/TPAMI.2024.3424835","DOIUrl":null,"url":null,"abstract":"<p><p>Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. To decode diversified outputs from transformers, auto-regressive sampling is the most common method, but with extremely low efficiency. We further overcome this issue by proposing a new decoding strategy, temperature annealing probabilistic sampling (TAPS), which firstly achieves more than 70× speedup of inference at most, meanwhile maintaining the high quality and diversity of the sampled global structures. Moreover, we find the full CNN architecture will lead to suboptimal solutions for guided upsampling. To render more realistic and coherent contents, we design a novel module, named texture-aware guided attention, to concurrently consider the procedures of texture copy and generation, meanwhile raising several important modifications to solve the boundary artifacts. Through dense experiments, we found the proposed method vastly outperforms state-of-the-art methods in terms of four aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet. 4) Much higher decoding efficiency over previous auto-regressive based methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3424835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. To decode diversified outputs from transformers, auto-regressive sampling is the most common method, but with extremely low efficiency. We further overcome this issue by proposing a new decoding strategy, temperature annealing probabilistic sampling (TAPS), which firstly achieves more than 70× speedup of inference at most, meanwhile maintaining the high quality and diversity of the sampled global structures. Moreover, we find the full CNN architecture will lead to suboptimal solutions for guided upsampling. To render more realistic and coherent contents, we design a novel module, named texture-aware guided attention, to concurrently consider the procedures of texture copy and generation, meanwhile raising several important modifications to solve the boundary artifacts. Through dense experiments, we found the proposed method vastly outperforms state-of-the-art methods in terms of four aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet. 4) Much higher decoding efficiency over previous auto-regressive based methods.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用变换器实现高保真、高效的多元图像补全。
卷积神经网络(CNN)具有强大的纹理建模能力,在图像补全方面取得了巨大进步。然而,由于一些固有特性(如局部归纳先验、空间不变核),卷积神经网络在理解全局结构方面表现不佳,也不能自然地支持多元补全。最近,变换器展示了其在长期关系建模和生成多样化结果方面的能力,但其计算复杂度是输入长度的二次方,因此阻碍了其在处理高分辨率图像方面的应用。本文为多元图像补全带来了两全其美的方法:利用变换器进行外观先验重建,利用 CNN 进行纹理补全。前者利用变换器恢复多元相干结构和一些粗纹理,后者利用 CNN 在高分辨率遮蔽图像的引导下增强粗先验的局部纹理细节。要解码变换器的多样化输出,自动回归采样是最常用的方法,但效率极低。我们进一步克服了这一问题,提出了一种新的解码策略--温度退火概率采样(TAPS),它首先使推理速度最多提高了 70 倍以上,同时保持了采样全局结构的高质量和多样性。此外,我们还发现全 CNN 架构会导致引导上采样的次优解。为了呈现更真实、更连贯的内容,我们设计了一个名为 "纹理感知引导注意力 "的新模块,同时考虑纹理复制和生成的过程,并提出了几个重要的修改来解决边界伪影问题。通过密集的实验,我们发现所提出的方法在四个方面大大优于最先进的方法:1) 与确定性补全方法相比,在图像保真度方面有很大的性能提升;2) 多元补全方法有更好的多样性和更高的保真度;3) 在大型掩码和通用数据集(如 ImageNet)上有卓越的泛化能力。4) 解码效率远高于之前基于自动回归的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Diversifying Policies with Non-Markov Dispersion to Expand the Solution Space. Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation. Variational Label Enhancement for Instance-Dependent Partial Label Learning. TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation. Efficient Neural Collaborative Search for Pickup and Delivery Problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1