{"title":"While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria.","authors":"Tsung-An Chen, Kuan-Chen Lin, Ming-Hwai Lin, Hsiao-Ting Chang, Yu-Chun Chen, Tzeng-Ji Chen","doi":"10.1097/JCMA.0000000000001225","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese.</p><p><strong>Methods: </strong>The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests.</p><p><strong>Results: </strong>ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in 7 out of 10 basic medical science subjects and 3 out of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates.</p><p><strong>Conclusion: </strong>ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.</p>","PeriodicalId":94115,"journal":{"name":"Journal of the Chinese Medical Association : JCMA","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Chinese Medical Association : JCMA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/JCMA.0000000000001225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese.
Methods: The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests.
Results: ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in 7 out of 10 basic medical science subjects and 3 out of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates.
Conclusion: ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.