From text to mask: Localizing entities using the attention of text-to-image diffusion models

IF 6.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-08-22 DOI:10.1016/j.neucom.2024.128437
Changming Xiao , Qi Yang , Feng Zhou , Changshui Zhang
{"title":"From text to mask: Localizing entities using the attention of text-to-image diffusion models","authors":"Changming Xiao ,&nbsp;Qi Yang ,&nbsp;Feng Zhou ,&nbsp;Changshui Zhang","doi":"10.1016/j.neucom.2024.128437","DOIUrl":null,"url":null,"abstract":"<div><p>Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. This work proposes a simple but effective method to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without additional training time nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called “personalized referring image segmentation” with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"610 ","pages":"Article 128437"},"PeriodicalIF":6.5000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224012086","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. This work proposes a simple but effective method to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without additional training time nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called “personalized referring image segmentation” with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从文本到遮罩:利用文本到图像扩散模型的注意力定位实体
最近,扩散模型在文本到图像生成领域掀起了一股热潮。这种融合文字和图像信息的独特方式使其在生成与文字高度相关的图像方面具有非凡的能力。从另一个角度看,这些生成模型暗示了文字和像素之间精确相关的线索。本研究提出了一种简单而有效的方法,在文本到图像扩散模型的去噪网络中利用注意力机制。无需额外的训练时间和推理时间优化,就能直接获得短语的语义基础。在弱监督语义分割设置下,我们在 Pascal VOC 2012 和 Microsoft COCO 2014 上对我们的方法进行了评估,结果表明我们的方法比之前的方法性能更优。此外,获得的单词-像素相关性可用于定制生成方法的学习文本嵌入,只需稍作修改即可。为了验证我们的发现,我们引入了一项新的实际任务,即 "个性化参考图像分割 "和一个新的数据集。在各种情况下进行的实验证明,在这项任务中,与强大的基准相比,我们的方法更具优势。总之,我们的工作揭示了一种提取隐藏在扩散模型中的丰富多模态知识进行分割的新方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
Editorial Board Dismantling strategies for cost networks based on multi-view deep learning Contrastive coarse-to-fine medical segmentation with prototype guidance and dual-granularity fusion LECMARL: A cooperative multi-agent reinforcement learning method based on lazy mechanisms and efficient exploration Offset-corrected query generation strategies for cross-modality misalignment in 3D object detection: aligning LiDAR and camera
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1