twi - compbench ++：一个用于合成文本到图像生成的增强和全面的基准

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-20 DOI:10.1109/TPAMI.2025.3531907

Kaiyi Huang;Chengqi Duan;Kaiyue Sun;Enze Xie;Zhenguo Li;Xihui Liu

{"title":"twi - compbench ++：一个用于合成文本到图像生成的增强和全面的基准","authors":"Kaiyi Huang;Chengqi Duan;Kaiyue Sun;Enze Xie;Zhenguo Li;Xihui Liu","doi":"10.1109/TPAMI.2025.3531907","DOIUrl":null,"url":null,"abstract":"Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), <i>i.e</i>. GPT-4 V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-<inline-formula><tex-math>$\\alpha$</tex-math></inline-formula>, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3563-3579"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10847875","citationCount":"0","resultStr":"{\"title\":\"T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation\",\"authors\":\"Kaiyi Huang;Chengqi Duan;Kaiyue Sun;Enze Xie;Zhenguo Li;Xihui Liu\",\"doi\":\"10.1109/TPAMI.2025.3531907\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), <i>i.e</i>. GPT-4 V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-<inline-formula><tex-math>$\\\\alpha$</tex-math></inline-formula>, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 5\",\"pages\":\"3563-3579\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10847875\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10847875/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10847875/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管文本到图像模型取得了令人印象深刻的进步，但它们往往难以有效地组合具有多个对象的复杂场景，显示各种属性和关系。为了应对这一挑战，我们提出了tti - compbench ++，这是一个用于合成文本到图像生成的增强基准。twi - compbench ++包含8,000个组合文本提示，分为四大类：属性绑定、对象关系、生成计算和复杂组合。这些进一步分为八个子类，包括新引入的3d空间关系和计算能力。除了基准之外，我们还提出了增强的评估指标，旨在评估这些不同的构成挑战。其中包括为评估3d空间关系和计算能力量身定制的基于检测的指标，以及利用多模态大型语言模型（MLLMs）（即gpt - 4v， ShareGPT4v）作为评估指标的分析。我们的实验对11个文本到图像模型进行基准测试，包括最先进的模型，如FLUX.1、SD3、dale -3、Pixart-$\alpha$和twi - compbench ++上的SD-XL。我们还进行了全面的评估，以验证我们的指标的有效性，并探索传销的潜力和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation

Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4 V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-

$\alpha$

, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量