João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques
{"title":"Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.","authors":"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques","doi":"10.1136/bmjhci-2024-101195","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.</p><p><strong>Methods: </strong>This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.</p><p><strong>Results: </strong>Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.</p><p><strong>Conclusions: </strong>10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.
Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.
Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.
Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.