M.J. Hetz , N. Carl , S. Haggenmüller , C. Wies , J.N. Kather , M.S. Michel , F. Wessels , T.J. Brinker
{"title":"Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines","authors":"M.J. Hetz , N. Carl , S. Haggenmüller , C. Wies , J.N. Kather , M.S. Michel , F. Wessels , T.J. Brinker","doi":"10.1016/j.esmorw.2024.100078","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.</div></div><div><h3>Materials and methods</h3><div>We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).</div></div><div><h3>Results</h3><div>UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.</div></div><div><h3>Conclusions</h3><div>UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.</div></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"6 ","pages":"Article 100078"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820124000560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.
Materials and methods
We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).
Results
UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.
Conclusions
UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.