CLIP-Flow: Decoding images encoded in CLIP space

IF 18.3 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Computational Visual Media Pub Date : 2024-08-28 DOI:10.1007/s41095-023-0375-z

Hao Ma, Ming Li, Jingyuan Yang, Or Patashnik, Dani Lischinski, Daniel Cohen-Or, Hui Huang

{"title":"CLIP-Flow: Decoding images encoded in CLIP space","authors":"Hao Ma, Ming Li, Jingyuan Yang, Or Patashnik, Dani Lischinski, Daniel Cohen-Or, Hui Huang","doi":"10.1007/s41095-023-0375-z","DOIUrl":null,"url":null,"abstract":"<p>This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.\n</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"30 1","pages":""},"PeriodicalIF":18.3000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Visual Media","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s41095-023-0375-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CLIP-Flow：解码以 CLIP 空间编码的图像

本研究介绍了 CLIP-Flow，这是一种从给定图像或文本生成图像的新型网络。为了有效利用这两种模式所包含的丰富语义，我们设计了一种语义引导的图像和文本到图像合成方法。具体而言，我们采用对比语言-图像预训练（CLIP）作为编码器来提取语义，并采用 StyleGAN 作为解码器来根据这些信息生成图像。此外，为了连接 CLIP 的嵌入空间和 StyleGAN 的潜空间，我们采用了真实 NVP，并对其进行了激活归一化和反向卷积修改。由于 CLIP 中的图像和文本共享相同的表示空间，因此文本提示可以直接输入 CLIP-Flow，实现文本到图像的合成。我们在多个数据集上进行了大量实验，以验证所提出的图像到图像合成方法的有效性。此外，我们还在公共数据集 Multi-Modal CelebA-HQ 上进行了文本到图像合成的测试。实验验证了我们的方法可以生成高质量的文本匹配图像，并且在质量和数量上都可以与最先进的方法相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Visual Media Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

16.90

自引率

5.80%

发文量

243

审稿时长

6 weeks

期刊介绍： Computational Visual Media is a peer-reviewed open access journal. It publishes original high-quality research papers and significant review articles on novel ideas, methods, and systems relevant to visual media. Computational Visual Media publishes articles that focus on, but are not limited to, the following areas: • Editing and composition of visual media • Geometric computing for images and video • Geometry modeling and processing • Machine learning for visual media • Physically based animation • Realistic rendering • Recognition and understanding of visual media • Visual computing for robotics • Visualization and visual analytics Other interdisciplinary research into visual media that combines aspects of computer graphics, computer vision, image and video processing, geometric computing, and machine learning is also within the journal''s scope. This is an open access journal, published quarterly by Tsinghua University Press and Springer. The open access fees (article-processing charges) are fully sponsored by Tsinghua University, China. Authors can publish in the journal without any additional charges.

期刊最新文献

TrafPS: A shapley-based visual analytics approach to interpret traffic CLIP-Flow: Decoding images encoded in CLIP space CLIP-SP: Vision-language model with adaptive prompting for scene parsing SGformer: Boosting transformers for indoor lighting estimation from a single image Central similarity consistency hashing for asymmetric image retrieval