Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, Daiqing Li
{"title":"Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models","authors":"Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, Daiqing Li","doi":"arxiv-2409.10695","DOIUrl":null,"url":null,"abstract":"We introduce Playground v3 (PGv3), our latest text-to-image model that\nachieves state-of-the-art (SoTA) performance across multiple testing\nbenchmarks, excels in graphic design abilities and introduces new capabilities.\nUnlike traditional text-to-image generative models that rely on pre-trained\nlanguage models like T5 or CLIP text encoders, our approach fully integrates\nLarge Language Models (LLMs) with a novel structure that leverages text\nconditions exclusively from a decoder-only LLM. Additionally, to enhance image\ncaptioning quality-we developed an in-house captioner, capable of generating\ncaptions with varying levels of detail, enriching the diversity of text\nstructures. We also introduce a new benchmark CapsBench to evaluate detailed\nimage captioning performance. Experimental results demonstrate that PGv3 excels\nin text prompt adherence, complex reasoning, and accurate text rendering. User\npreference studies indicate the super-human graphic design ability of our model\nfor common design applications, such as stickers, posters, and logo designs.\nFurthermore, PGv3 introduces new capabilities, including precise RGB color\ncontrol and robust multilingual understanding.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce Playground v3 (PGv3), our latest text-to-image model that
achieves state-of-the-art (SoTA) performance across multiple testing
benchmarks, excels in graphic design abilities and introduces new capabilities.
Unlike traditional text-to-image generative models that rely on pre-trained
language models like T5 or CLIP text encoders, our approach fully integrates
Large Language Models (LLMs) with a novel structure that leverages text
conditions exclusively from a decoder-only LLM. Additionally, to enhance image
captioning quality-we developed an in-house captioner, capable of generating
captions with varying levels of detail, enriching the diversity of text
structures. We also introduce a new benchmark CapsBench to evaluate detailed
image captioning performance. Experimental results demonstrate that PGv3 excels
in text prompt adherence, complex reasoning, and accurate text rendering. User
preference studies indicate the super-human graphic design ability of our model
for common design applications, such as stickers, posters, and logo designs.
Furthermore, PGv3 introduces new capabilities, including precise RGB color
control and robust multilingual understanding.