Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.

IF 4.1 Q1 HEALTH CARE SCIENCES & SERVICES BMJ Health & Care Informatics Pub Date : 2025-02-24 DOI:10.1136/bmjhci-2024-101195
João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques
{"title":"Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.","authors":"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques","doi":"10.1136/bmjhci-2024-101195","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.</p><p><strong>Methods: </strong>This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.</p><p><strong>Results: </strong>Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.</p><p><strong>Conclusions: </strong>10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.

Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.

Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.

Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
研究目的本研究旨在评估葡萄牙语医学知识验证测试中的顶级大型语言模型(LLMs):本研究以巴西国家医学考试为背景,比较了 31 种大型语言模型。研究比较了 23 个开源模型和 8 个专有模型在 399 道选择题中的表现:在小型模型中,Llama 3 8B 的成功率最高,达到 53.9%,而中型模型 Mixtral 8×7B 的成功率为 63.7%。相反,Llama 3 70B 等大型机型的成功率为 77.5%。在专利模型中,GPT-4o 和 Claude Opus 的准确率较高,分别为 86.8% 和 83.8%:结论:在 31 个 LLM 中,有 10 个在 Revalida 基准测试中取得了优于人类水平的成绩,其中 9 个未能提供连贯的任务答案。较大型的模型整体表现出更优越的性能。不过,某些中型 LLM 的性能超过了一些大型 LLM。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
4.90%
发文量
40
审稿时长
18 weeks
期刊最新文献
Biomarker and clinical data-based predictor tool (MAUXI) for ultrafiltration failure and cardiovascular outcome in peritoneal dialysis patients: a retrospective and longitudinal study. Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions. Potential role of ChatGPT in simplifying and improving informed consent forms for vaccination: a pilot study conducted in Italy. Evaluating the implementation of a digital coordination centre in an Australian hospital setting: a mixed method study protocol. Biodesign in the generative AI era: enhancing innovation and equity with NLP and LLM tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1