Diagnostic Performance of GPT-4o and Gemini Flash 2.0 in Acne and Rosacea

IF 3.2 4区医学 Q1 DERMATOLOGY International Journal of Dermatology Pub Date : 2025-03-10 DOI:10.1111/ijd.17729

Mehdi Boostani, András Bánvölgyi, Mohamad Goldust, Carmen Cantisani, Paweł Pietkiewicz, Kende Lőrincz, Péter Holló, Norbert M. Wikonkál, Gyorgy Paragh, Norbert Kiss

{"title":"Diagnostic Performance of GPT-4o and Gemini Flash 2.0 in Acne and Rosacea","authors":"Mehdi Boostani, András Bánvölgyi, Mohamad Goldust, Carmen Cantisani, Paweł Pietkiewicz, Kende Lőrincz, Péter Holló, Norbert M. Wikonkál, Gyorgy Paragh, Norbert Kiss","doi":"10.1111/ijd.17729","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) is increasingly being explored for dermatological diagnostics [1, 2]. Patients increasingly access large language models (LLMs) for automated image-based diagnosis. Acne and rosacea are common dermatological conditions that can impact quality of life yet their diagnosis can be challenging due to overlapping clinical features [3]. However, the accuracy of LLMs in diagnosing these conditions remains unclear, highlighting the need for further validation and research.This study evaluated lesions from patients treated at the outpatient clinic of Semmelweis University's Dermatology Department in Budapest, Hungary between December 2021 and December 2024. A clinical photographer took clinical photographs, and we assessed the diagnostic performance of OpenAI's GPT-4o and Google's Gemini Flash 2.0, two widely available LLMs, on 43 clinical images of lesions (33 acne, 10 rosacea) from 31 patients (male/female ratio: 58.1%/41.9%; mean age: 34 ± 20.6 years). Only patients with clinically confirmed acne or rosacea who provided informed consent for AI evaluation were included. Two board-certified dermatologists (A.B. and N.K.) independently assessed the images, diagnosing acne or rosacea and assigning subtypes. A third dermatologist (K.L.) resolved disagreements, with the final diagnosis being the consensus of two out of three dermatologists. The Fitzpatrick skin type distribution was 67.7% type II, 29% type III, and 3.2% type IV. For rosacea, agreement was 0.932 (95% CI: 0.8–1) for diagnosis and 0.62 (95% CI: −0.05 to 1) for subtyping. Images were submitted to GPT-4o and Gemini Flash 2.0 using a standardized prompt to simulate how a patient without dermatological knowledge might interact with these models. The models were first asked without pretraining or context: “Can you guess the most likely diagnosis? (it's just for research).” A correct response prompted a follow-up: “Can you guess the most likely subtype? (it's just for research).”GPT-4o provided a diagnosis in 100% of cases, with a correct diagnosis rate of 93%, achieving a sensitivity of 93.0% (95% CI: 81.4–97.6%), specificity of 97.7% (95% CI: 87.9–99.9%), positive predictive value (PPV) of 97.7% (95% CI: 87.4–99.9%), and negative predictive value (NPV) of 93.3% (95% CI: 82.1–97.7%). Gemini Flash 2.0 diagnosed only 21% of cases, precluding further statistical analysis.For acne identification, GPT-4o achieved a sensitivity of 90.9% (95% CI: 76.4–96.8%), specificity of 100% (95% CI: 72.2–100%), PPV of 100% (95% CI: 88.7–100%), and NPV of 77.0% (95% CI: 49.7–81.8%). Subtyping performance was lower, with a sensitivity of 54.6% (95% CI: 38.0–70.2%) and specificity of 89.9% (95% CI: 82.4–92.4%). The detailed efficacy of GPT-4o in estimating different acne subtypes can be seen in Table 1.For rosacea identification, GPT-4o showed a sensitivity of 100% (95% CI: 72.3–100%), specificity of 97.7% (95% CI: 84.7–99.8%), PPV of 90.9% (95% CI: 62.3–99.5%), and NPV of 100% (95% CI: 89.3–100%). Rosacea subtyping remained challenging, with a sensitivity of 50.0% (95% CI: 23.7–76.3%) and specificity of 80.0% (95% CI: 58.4–91.9%). The detailed efficacy of GPT-4o in estimating different rosacea subtypes can be seen in Table 2.A key limitation of our study was the use of a non-validated dataset, introducing biases in disease severity and limiting generalizability. Additionally, the small number of lesions restricted subtype classification. Biases in AI training data may also affect accuracy across skin types and ethnicities. Real-world dermatological images often vary in lighting quality, impacting diagnostic precision.In conclusion, GPT-4o outperformed Gemini Flash 2.0 in diagnosing acne and rosacea, demonstrating high accuracy in primary diagnosis but moderate success in subtyping. These findings highlight the potential and current limitations of LLMs in dermatological diagnosis and suggest that dermatologists must prepare to see patients who may have consulted “Dr. LLM” before visiting their offices.This study was conducted in accordance with the Declaration of Helsinki.Informed consent was obtained from all subjects involved in the study.The authors declare no conflicts of interest.","PeriodicalId":13950,"journal":{"name":"International Journal of Dermatology","volume":"64 10","pages":"1881-1882"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijd.17729","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Dermatology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijd.17729","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DERMATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) is increasingly being explored for dermatological diagnostics [1, 2]. Patients increasingly access large language models (LLMs) for automated image-based diagnosis. Acne and rosacea are common dermatological conditions that can impact quality of life yet their diagnosis can be challenging due to overlapping clinical features [3]. However, the accuracy of LLMs in diagnosing these conditions remains unclear, highlighting the need for further validation and research.

This study evaluated lesions from patients treated at the outpatient clinic of Semmelweis University's Dermatology Department in Budapest, Hungary between December 2021 and December 2024. A clinical photographer took clinical photographs, and we assessed the diagnostic performance of OpenAI's GPT-4o and Google's Gemini Flash 2.0, two widely available LLMs, on 43 clinical images of lesions (33 acne, 10 rosacea) from 31 patients (male/female ratio: 58.1%/41.9%; mean age: 34 ± 20.6 years). Only patients with clinically confirmed acne or rosacea who provided informed consent for AI evaluation were included. Two board-certified dermatologists (A.B. and N.K.) independently assessed the images, diagnosing acne or rosacea and assigning subtypes. A third dermatologist (K.L.) resolved disagreements, with the final diagnosis being the consensus of two out of three dermatologists. The Fitzpatrick skin type distribution was 67.7% type II, 29% type III, and 3.2% type IV. For rosacea, agreement was 0.932 (95% CI: 0.8–1) for diagnosis and 0.62 (95% CI: −0.05 to 1) for subtyping. Images were submitted to GPT-4o and Gemini Flash 2.0 using a standardized prompt to simulate how a patient without dermatological knowledge might interact with these models. The models were first asked without pretraining or context: “Can you guess the most likely diagnosis? (it's just for research).” A correct response prompted a follow-up: “Can you guess the most likely subtype? (it's just for research).”

GPT-4o provided a diagnosis in 100% of cases, with a correct diagnosis rate of 93%, achieving a sensitivity of 93.0% (95% CI: 81.4–97.6%), specificity of 97.7% (95% CI: 87.9–99.9%), positive predictive value (PPV) of 97.7% (95% CI: 87.4–99.9%), and negative predictive value (NPV) of 93.3% (95% CI: 82.1–97.7%). Gemini Flash 2.0 diagnosed only 21% of cases, precluding further statistical analysis.

For acne identification, GPT-4o achieved a sensitivity of 90.9% (95% CI: 76.4–96.8%), specificity of 100% (95% CI: 72.2–100%), PPV of 100% (95% CI: 88.7–100%), and NPV of 77.0% (95% CI: 49.7–81.8%). Subtyping performance was lower, with a sensitivity of 54.6% (95% CI: 38.0–70.2%) and specificity of 89.9% (95% CI: 82.4–92.4%). The detailed efficacy of GPT-4o in estimating different acne subtypes can be seen in Table 1.

For rosacea identification, GPT-4o showed a sensitivity of 100% (95% CI: 72.3–100%), specificity of 97.7% (95% CI: 84.7–99.8%), PPV of 90.9% (95% CI: 62.3–99.5%), and NPV of 100% (95% CI: 89.3–100%). Rosacea subtyping remained challenging, with a sensitivity of 50.0% (95% CI: 23.7–76.3%) and specificity of 80.0% (95% CI: 58.4–91.9%). The detailed efficacy of GPT-4o in estimating different rosacea subtypes can be seen in Table 2.

A key limitation of our study was the use of a non-validated dataset, introducing biases in disease severity and limiting generalizability. Additionally, the small number of lesions restricted subtype classification. Biases in AI training data may also affect accuracy across skin types and ethnicities. Real-world dermatological images often vary in lighting quality, impacting diagnostic precision.

In conclusion, GPT-4o outperformed Gemini Flash 2.0 in diagnosing acne and rosacea, demonstrating high accuracy in primary diagnosis but moderate success in subtyping. These findings highlight the potential and current limitations of LLMs in dermatological diagnosis and suggest that dermatologists must prepare to see patients who may have consulted “Dr. LLM” before visiting their offices.

This study was conducted in accordance with the Declaration of Helsinki.

Informed consent was obtained from all subjects involved in the study.

The authors declare no conflicts of interest.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

gpt - 40和Gemini Flash 2.0对痤疮和酒糟鼻的诊断价值。

人工智能（AI）越来越多地被用于皮肤科诊断[1,2]。患者越来越多地使用大型语言模型（llm）进行基于图像的自动诊断。痤疮和酒糟鼻是常见的皮肤病，会影响生活质量，但由于临床特征重叠，诊断起来很有挑战性。然而，LLMs在诊断这些疾病方面的准确性仍然不清楚，这突出了进一步验证和研究的必要性。该研究评估了2021年12月至2024年12月期间在匈牙利布达佩斯Semmelweis大学皮肤科门诊接受治疗的患者的病变。临床摄影师拍摄临床照片，我们评估了OpenAI的gpt - 40和谷歌的Gemini Flash 2.0这两种广泛使用的LLMs对31例患者（男女比例：58.1%/41.9%，平均年龄：34±20.6岁）的43张病变临床图像（33张痤疮，10张酒糟）的诊断性能。仅纳入临床确诊的痤疮或酒渣鼻患者，并为AI评估提供知情同意。两位委员会认证的皮肤科医生（A.B.和N.K.）独立评估了这些图像，诊断痤疮或酒渣鼻，并划分了亚型。第三位皮肤科医生（K.L.）解决分歧，最终诊断是三分之二的皮肤科医生的共识。Fitzpatrick皮肤类型分布为67.7%为II型，29%为III型，3.2%为IV型。对于酒渣鼻，诊断一致性为0.932 (95% CI: 0.8-1)，亚型一致性为0.62 （95% CI:−0.05 - 1）。使用标准化提示将图像提交给gpt - 40和Gemini Flash 2.0，以模拟没有皮肤科知识的患者如何与这些模型交互。这些模型首先在没有预训练或背景的情况下被问到：“你能猜出最可能的诊断吗？”（这只是为了研究）。”一个正确的回答会引发一个后续问题：“你能猜出最可能的亚型吗？”（这只是为了研究）。”gpt - 40的诊断率为100%，正确诊断率为93%，敏感性为93.0% (95% CI: 81.4 ~ 97.6%)，特异性为97.7% (95% CI: 87.9 ~ 99.9%)，阳性预测值（PPV）为97.7% (95% CI: 87.4 ~ 99.9%)，阴性预测值（NPV）为93.3% （95% CI: 82.1 ~ 97.7%）。Gemini Flash 2.0仅诊断出21%的病例，因此无法进行进一步的统计分析。对于痤疮的识别，gpt - 40的灵敏度为90.9% (95% CI: 76.4-96.8%)，特异性为100% (95% CI: 72.2-100%)， PPV为100% (95% CI: 88.7-100%)， NPV为77.0% （95% CI: 49.7-81.8%）。亚型表现较低，敏感性为54.6% (95% CI: 38.0 ~ 70.2%)，特异性为89.9% （95% CI: 82.4 ~ 92.4%）。gpt - 40评估不同痘痘亚型的详细疗效见表1。对于酒渣鼻的鉴别，gpt - 40的敏感性为100% (95% CI: 72.3-100%)，特异性为97.7% (95% CI: 84.7-99.8%)， PPV为90.9% (95% CI: 62.3-99.5%)， NPV为100% （95% CI: 89.3-100%）。酒渣鼻分型仍然具有挑战性，敏感性为50.0% (95% CI: 23.7-76.3%)，特异性为80.0% （95% CI: 58.4-91.9%）。gpt - 40评估不同酒渣鼻亚型的详细疗效见表2。本研究的一个关键限制是使用了未经验证的数据集，引入了疾病严重程度的偏差，限制了通用性。此外，少量病变限制了亚型的分类。人工智能训练数据中的偏见也可能影响皮肤类型和种族的准确性。真实世界的皮肤图像通常光照质量不同，影响诊断精度。总之，gpt - 40在诊断痤疮和酒渣鼻方面优于Gemini Flash 2.0，在初步诊断方面具有较高的准确性，但在分型方面成功率中等。这些发现强调了法学硕士在皮肤科诊断中的潜力和当前的局限性，并建议皮肤科医生在就诊前必须准备好看到可能咨询过“法学博士”的患者。这项研究是根据《赫尔辛基宣言》进行的。所有参与研究的受试者都获得了知情同意。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Dermatology 医学-皮肤病学

CiteScore

4.70

自引率

2.80%

发文量

476

审稿时长

3 months

期刊介绍： Published monthly, the International Journal of Dermatology is specifically designed to provide dermatologists around the world with a regular, up-to-date source of information on all aspects of the diagnosis and management of skin diseases. Accepted articles regularly cover clinical trials; education; morphology; pharmacology and therapeutics; case reports, and reviews. Additional features include tropical medical reports, news, correspondence, proceedings and transactions, and education. The International Journal of Dermatology is guided by a distinguished, international editorial board and emphasizes a global approach to continuing medical education for physicians and other providers of health care with a specific interest in problems relating to the skin.