评估从文字到图像生成的逼真人体解剖图像

Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin Kuebler, Hatice Kuebra Parmaksiz, Cheng Chen, Pablo Sebastian Bolanos Orozco, Soeren S. Lienkamp, Janna Hastings
{"title":"评估从文字到图像生成的逼真人体解剖图像","authors":"Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin Kuebler, Hatice Kuebra Parmaksiz, Cheng Chen, Pablo Sebastian Bolanos Orozco, Soeren S. Lienkamp, Janna Hastings","doi":"10.1101/2024.08.21.24312353","DOIUrl":null,"url":null,"abstract":"Background: Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons. Methods: We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image. Findings: The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models. Interpretation: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy\",\"authors\":\"Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin Kuebler, Hatice Kuebra Parmaksiz, Cheng Chen, Pablo Sebastian Bolanos Orozco, Soeren S. Lienkamp, Janna Hastings\",\"doi\":\"10.1101/2024.08.21.24312353\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons. Methods: We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image. Findings: The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models. Interpretation: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.\",\"PeriodicalId\":501454,\"journal\":{\"name\":\"medRxiv - Health Informatics\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.21.24312353\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.21.24312353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:能够根据文本描述生成逼真图像的人工智能生成模型在医学领域有很多应用,包括医学教育和合成数据。然而,评估和比较这些模型的各种不同输出结果是一项挑战,因此需要一种系统的方法来对图像和模型进行比较。方法:我们开发了一种错误分类系统,用于标注人工智能生成的逼真人体图像中的错误,并将我们的方法应用于由三种不同模型(DALL-E 3、Stable Diffusion XL 和 Stable Cascade)生成的 240 幅图像组成的语料库,其中使用了 10 个提示,每个提示包含 8 幅图像。错误分类系统识别了五个解剖区域中三种不同严重程度的五种不同错误类型,并根据生成图像中每个预期解剖成分计数的错误汇总比例指定了相关的量化评分方法。我们通过对 25% 的图像进行双重注释和计算 Krippendorf's alpha 来评估评分者之间的一致性,并使用每张图像的累积分数对三种模型和十个提示的结果进行定量比较。研究结果错误分类系统、随附的培训手册、生成的图像集、注释和所有相关脚本均可从我们的 GitHub 存储库 https://github.com/hastingslab-org/ai-human-images 获取。评分者之间的一致性相对较差,这反映了错误分类任务的主观性。模型比较显示,DALL-E 3 的表现一直优于稳定扩散,但后者生成的图像反映了更多样化的个人属性。对所有模型来说,群组图像比个人或双人图像更具挑战性;有些提示对所有模型来说都具有挑战性。解释:我们的方法可以对人工智能生成的逼真人类图像进行系统比较;我们的结果可以促进这些模型在医疗应用方面的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy
Background: Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons. Methods: We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image. Findings: The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models. Interpretation: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A case is not a case is not a case - challenges and solutions in determining urolithiasis caseloads using the digital infrastructure of a clinical data warehouse Reliable Online Auditory Cognitive Testing: An observational study Federated Multiple Imputation for Variables that Are Missing Not At Random in Distributed Electronic Health Records Characterizing the connection between Parkinson's disease progression and healthcare utilization Generative AI and Large Language Models in Reducing Medication Related Harm and Adverse Drug Events - A Scoping Review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1