Mehdi Boostani, András Bánvölgyi, Mohamad Goldust, Carmen Cantisani, Paweł Pietkiewicz, Kende Lőrincz, Péter Holló, Norbert M. Wikonkál, Gyorgy Paragh, Norbert Kiss
{"title":"Diagnostic Performance of GPT-4o and Gemini Flash 2.0 in Acne and Rosacea","authors":"Mehdi Boostani, András Bánvölgyi, Mohamad Goldust, Carmen Cantisani, Paweł Pietkiewicz, Kende Lőrincz, Péter Holló, Norbert M. Wikonkál, Gyorgy Paragh, Norbert Kiss","doi":"10.1111/ijd.17729","DOIUrl":null,"url":null,"abstract":"<p>Artificial intelligence (AI) is increasingly being explored for dermatological diagnostics [<span>1, 2</span>]. Patients increasingly access large language models (LLMs) for automated image-based diagnosis. Acne and rosacea are common dermatological conditions that can impact quality of life yet their diagnosis can be challenging due to overlapping clinical features [<span>3</span>]. However, the accuracy of LLMs in diagnosing these conditions remains unclear, highlighting the need for further validation and research.</p><p>This study evaluated lesions from patients treated at the outpatient clinic of Semmelweis University's Dermatology Department in Budapest, Hungary between December 2021 and December 2024. A clinical photographer took clinical photographs, and we assessed the diagnostic performance of OpenAI's GPT-4o and Google's Gemini Flash 2.0, two widely available LLMs, on 43 clinical images of lesions (33 acne, 10 rosacea) from 31 patients (male/female ratio: 58.1%/41.9%; mean age: 34 ± 20.6 years). Only patients with clinically confirmed acne or rosacea who provided informed consent for AI evaluation were included. Two board-certified dermatologists (A.B. and N.K.) independently assessed the images, diagnosing acne or rosacea and assigning subtypes. A third dermatologist (K.L.) resolved disagreements, with the final diagnosis being the consensus of two out of three dermatologists. The Fitzpatrick skin type distribution was 67.7% type II, 29% type III, and 3.2% type IV. For rosacea, agreement was 0.932 (95% CI: 0.8–1) for diagnosis and 0.62 (95% CI: −0.05 to 1) for subtyping. Images were submitted to GPT-4o and Gemini Flash 2.0 using a standardized prompt to simulate how a patient without dermatological knowledge might interact with these models. The models were first asked without pretraining or context: “Can you guess the most likely diagnosis? (it's just for research).” A correct response prompted a follow-up: “Can you guess the most likely subtype? (it's just for research).”</p><p>GPT-4o provided a diagnosis in 100% of cases, with a correct diagnosis rate of 93%, achieving a sensitivity of 93.0% (95% CI: 81.4–97.6%), specificity of 97.7% (95% CI: 87.9–99.9%), positive predictive value (PPV) of 97.7% (95% CI: 87.4–99.9%), and negative predictive value (NPV) of 93.3% (95% CI: 82.1–97.7%). Gemini Flash 2.0 diagnosed only 21% of cases, precluding further statistical analysis.</p><p>For acne identification, GPT-4o achieved a sensitivity of 90.9% (95% CI: 76.4–96.8%), specificity of 100% (95% CI: 72.2–100%), PPV of 100% (95% CI: 88.7–100%), and NPV of 77.0% (95% CI: 49.7–81.8%). Subtyping performance was lower, with a sensitivity of 54.6% (95% CI: 38.0–70.2%) and specificity of 89.9% (95% CI: 82.4–92.4%). The detailed efficacy of GPT-4o in estimating different acne subtypes can be seen in Table 1.</p><p>For rosacea identification, GPT-4o showed a sensitivity of 100% (95% CI: 72.3–100%), specificity of 97.7% (95% CI: 84.7–99.8%), PPV of 90.9% (95% CI: 62.3–99.5%), and NPV of 100% (95% CI: 89.3–100%). Rosacea subtyping remained challenging, with a sensitivity of 50.0% (95% CI: 23.7–76.3%) and specificity of 80.0% (95% CI: 58.4–91.9%). The detailed efficacy of GPT-4o in estimating different rosacea subtypes can be seen in Table 2.</p><p>A key limitation of our study was the use of a non-validated dataset, introducing biases in disease severity and limiting generalizability. Additionally, the small number of lesions restricted subtype classification. Biases in AI training data may also affect accuracy across skin types and ethnicities. Real-world dermatological images often vary in lighting quality, impacting diagnostic precision.</p><p>In conclusion, GPT-4o outperformed Gemini Flash 2.0 in diagnosing acne and rosacea, demonstrating high accuracy in primary diagnosis but moderate success in subtyping. These findings highlight the potential and current limitations of LLMs in dermatological diagnosis and suggest that dermatologists must prepare to see patients who may have consulted “Dr. LLM” before visiting their offices.</p><p>This study was conducted in accordance with the Declaration of Helsinki.</p><p>Informed consent was obtained from all subjects involved in the study.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":13950,"journal":{"name":"International Journal of Dermatology","volume":"64 10","pages":"1881-1882"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijd.17729","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Dermatology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijd.17729","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DERMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) is increasingly being explored for dermatological diagnostics [1, 2]. Patients increasingly access large language models (LLMs) for automated image-based diagnosis. Acne and rosacea are common dermatological conditions that can impact quality of life yet their diagnosis can be challenging due to overlapping clinical features [3]. However, the accuracy of LLMs in diagnosing these conditions remains unclear, highlighting the need for further validation and research.
This study evaluated lesions from patients treated at the outpatient clinic of Semmelweis University's Dermatology Department in Budapest, Hungary between December 2021 and December 2024. A clinical photographer took clinical photographs, and we assessed the diagnostic performance of OpenAI's GPT-4o and Google's Gemini Flash 2.0, two widely available LLMs, on 43 clinical images of lesions (33 acne, 10 rosacea) from 31 patients (male/female ratio: 58.1%/41.9%; mean age: 34 ± 20.6 years). Only patients with clinically confirmed acne or rosacea who provided informed consent for AI evaluation were included. Two board-certified dermatologists (A.B. and N.K.) independently assessed the images, diagnosing acne or rosacea and assigning subtypes. A third dermatologist (K.L.) resolved disagreements, with the final diagnosis being the consensus of two out of three dermatologists. The Fitzpatrick skin type distribution was 67.7% type II, 29% type III, and 3.2% type IV. For rosacea, agreement was 0.932 (95% CI: 0.8–1) for diagnosis and 0.62 (95% CI: −0.05 to 1) for subtyping. Images were submitted to GPT-4o and Gemini Flash 2.0 using a standardized prompt to simulate how a patient without dermatological knowledge might interact with these models. The models were first asked without pretraining or context: “Can you guess the most likely diagnosis? (it's just for research).” A correct response prompted a follow-up: “Can you guess the most likely subtype? (it's just for research).”
GPT-4o provided a diagnosis in 100% of cases, with a correct diagnosis rate of 93%, achieving a sensitivity of 93.0% (95% CI: 81.4–97.6%), specificity of 97.7% (95% CI: 87.9–99.9%), positive predictive value (PPV) of 97.7% (95% CI: 87.4–99.9%), and negative predictive value (NPV) of 93.3% (95% CI: 82.1–97.7%). Gemini Flash 2.0 diagnosed only 21% of cases, precluding further statistical analysis.
For acne identification, GPT-4o achieved a sensitivity of 90.9% (95% CI: 76.4–96.8%), specificity of 100% (95% CI: 72.2–100%), PPV of 100% (95% CI: 88.7–100%), and NPV of 77.0% (95% CI: 49.7–81.8%). Subtyping performance was lower, with a sensitivity of 54.6% (95% CI: 38.0–70.2%) and specificity of 89.9% (95% CI: 82.4–92.4%). The detailed efficacy of GPT-4o in estimating different acne subtypes can be seen in Table 1.
For rosacea identification, GPT-4o showed a sensitivity of 100% (95% CI: 72.3–100%), specificity of 97.7% (95% CI: 84.7–99.8%), PPV of 90.9% (95% CI: 62.3–99.5%), and NPV of 100% (95% CI: 89.3–100%). Rosacea subtyping remained challenging, with a sensitivity of 50.0% (95% CI: 23.7–76.3%) and specificity of 80.0% (95% CI: 58.4–91.9%). The detailed efficacy of GPT-4o in estimating different rosacea subtypes can be seen in Table 2.
A key limitation of our study was the use of a non-validated dataset, introducing biases in disease severity and limiting generalizability. Additionally, the small number of lesions restricted subtype classification. Biases in AI training data may also affect accuracy across skin types and ethnicities. Real-world dermatological images often vary in lighting quality, impacting diagnostic precision.
In conclusion, GPT-4o outperformed Gemini Flash 2.0 in diagnosing acne and rosacea, demonstrating high accuracy in primary diagnosis but moderate success in subtyping. These findings highlight the potential and current limitations of LLMs in dermatological diagnosis and suggest that dermatologists must prepare to see patients who may have consulted “Dr. LLM” before visiting their offices.
This study was conducted in accordance with the Declaration of Helsinki.
Informed consent was obtained from all subjects involved in the study.
期刊介绍:
Published monthly, the International Journal of Dermatology is specifically designed to provide dermatologists around the world with a regular, up-to-date source of information on all aspects of the diagnosis and management of skin diseases. Accepted articles regularly cover clinical trials; education; morphology; pharmacology and therapeutics; case reports, and reviews. Additional features include tropical medical reports, news, correspondence, proceedings and transactions, and education.
The International Journal of Dermatology is guided by a distinguished, international editorial board and emphasizes a global approach to continuing medical education for physicians and other providers of health care with a specific interest in problems relating to the skin.