Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang
{"title":"基于扩散的视觉感知的内隐和外显语言引导","authors":"Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang","doi":"10.1109/TMM.2024.3521825","DOIUrl":null,"url":null,"abstract":"Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU<inline-formula><tex-math>$^\\text{ss}$</tex-math></inline-formula> score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"466-476"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception\",\"authors\":\"Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang\",\"doi\":\"10.1109/TMM.2024.3521825\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU<inline-formula><tex-math>$^\\\\text{ss}$</tex-math></inline-formula> score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"466-476\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10814050/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814050/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception
Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.