Geoffrey Currie, Johnathan Hewis, Elizabeth Hawk, Eric Rohren
{"title":"Gender and Ethnicity Bias of Text-to-Image Generative Artificial Intelligence in Medical Imaging, Part 1: Preliminary Evaluation.","authors":"Geoffrey Currie, Johnathan Hewis, Elizabeth Hawk, Eric Rohren","doi":"10.2967/jnmt.124.268332","DOIUrl":null,"url":null,"abstract":"<p><p>Generative artificial intelligence (AI) text-to-image production could reinforce or amplify gender and ethnicity biases. Several text-to-image generative AI tools are used for producing images that represent the medical imaging professions. White male stereotyping and masculine cultures can dissuade women and ethnically divergent people from being drawn into a profession. <b>Methods:</b> In March 2024, DALL-E 3, Firefly 2, Stable Diffusion 2.1, and Midjourney 5.2 were utilized to generate a series of individual and group images of medical imaging professionals: radiologist, nuclear medicine physician, radiographer, and nuclear medicine technologist. Multiple iterations of images were generated using a variety of prompts. Collectively, 184 images were produced for evaluation of 391 characters. All images were independently analyzed by 3 reviewers for apparent gender and skin tone. <b>Results:</b> Collectively (individual and group characters) (<i>n</i> = 391), 60.6% were male and 87.7% were of a light skin tone. DALL-E 3 (65.6%), Midjourney 5.2 (76.7%), and Stable Diffusion 2.1 (56.2%) had a statistically higher representation of men than Firefly 2 (42.9%) (<i>P</i> < 0.0001). With Firefly 2, 70.3% of characters had light skin tones, which was statistically lower (<i>P</i> < 0.0001) than for Stable Diffusion 2.1 (84.8%), Midjourney 5.2 (100%), and DALL-E 3 (94.8%). Overall, image quality metrics were average or better in 87.2% for DALL-E 3 and 86.2% for Midjourney 5.2, whereas 50.9% were inadequate or poor for Firefly 2 and 86.0% for Stable Diffusion 2.1. <b>Conclusion:</b> Generative AI text-to-image generation using DALL-E 3 via GPT-4 has the best overall quality compared with Firefly 2, Midjourney 5.2, and Stable Diffusion 2.1. Nonetheless, DALL-E 3 includes inherent biases associated with gender and ethnicity that demand more critical evaluation.</p>","PeriodicalId":16548,"journal":{"name":"Journal of nuclear medicine technology","volume":" ","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of nuclear medicine technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2967/jnmt.124.268332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Generative artificial intelligence (AI) text-to-image production could reinforce or amplify gender and ethnicity biases. Several text-to-image generative AI tools are used for producing images that represent the medical imaging professions. White male stereotyping and masculine cultures can dissuade women and ethnically divergent people from being drawn into a profession. Methods: In March 2024, DALL-E 3, Firefly 2, Stable Diffusion 2.1, and Midjourney 5.2 were utilized to generate a series of individual and group images of medical imaging professionals: radiologist, nuclear medicine physician, radiographer, and nuclear medicine technologist. Multiple iterations of images were generated using a variety of prompts. Collectively, 184 images were produced for evaluation of 391 characters. All images were independently analyzed by 3 reviewers for apparent gender and skin tone. Results: Collectively (individual and group characters) (n = 391), 60.6% were male and 87.7% were of a light skin tone. DALL-E 3 (65.6%), Midjourney 5.2 (76.7%), and Stable Diffusion 2.1 (56.2%) had a statistically higher representation of men than Firefly 2 (42.9%) (P < 0.0001). With Firefly 2, 70.3% of characters had light skin tones, which was statistically lower (P < 0.0001) than for Stable Diffusion 2.1 (84.8%), Midjourney 5.2 (100%), and DALL-E 3 (94.8%). Overall, image quality metrics were average or better in 87.2% for DALL-E 3 and 86.2% for Midjourney 5.2, whereas 50.9% were inadequate or poor for Firefly 2 and 86.0% for Stable Diffusion 2.1. Conclusion: Generative AI text-to-image generation using DALL-E 3 via GPT-4 has the best overall quality compared with Firefly 2, Midjourney 5.2, and Stable Diffusion 2.1. Nonetheless, DALL-E 3 includes inherent biases associated with gender and ethnicity that demand more critical evaluation.