使用欧洲泌尿外科协会指南增强的可解释语言模型,在泌尿外科委员会问题上取得超人成绩

M.J. Hetz , N. Carl , S. Haggenmüller , C. Wies , J.N. Kather , M.S. Michel , F. Wessels , T.J. Brinker
{"title":"使用欧洲泌尿外科协会指南增强的可解释语言模型,在泌尿外科委员会问题上取得超人成绩","authors":"M.J. Hetz ,&nbsp;N. Carl ,&nbsp;S. Haggenmüller ,&nbsp;C. Wies ,&nbsp;J.N. Kather ,&nbsp;M.S. Michel ,&nbsp;F. Wessels ,&nbsp;T.J. Brinker","doi":"10.1016/j.esmorw.2024.100078","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.</div></div><div><h3>Materials and methods</h3><div>We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).</div></div><div><h3>Results</h3><div>UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.</div></div><div><h3>Conclusions</h3><div>UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.</div></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"6 ","pages":"Article 100078"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines\",\"authors\":\"M.J. Hetz ,&nbsp;N. Carl ,&nbsp;S. Haggenmüller ,&nbsp;C. Wies ,&nbsp;J.N. Kather ,&nbsp;M.S. Michel ,&nbsp;F. Wessels ,&nbsp;T.J. Brinker\",\"doi\":\"10.1016/j.esmorw.2024.100078\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.</div></div><div><h3>Materials and methods</h3><div>We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).</div></div><div><h3>Results</h3><div>UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.</div></div><div><h3>Conclusions</h3><div>UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.</div></div>\",\"PeriodicalId\":100491,\"journal\":{\"name\":\"ESMO Real World Data and Digital Oncology\",\"volume\":\"6 \",\"pages\":\"Article 100078\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESMO Real World Data and Digital Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000560\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820124000560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景大语言模型对临床知识进行编码,无需进一步训练即可立即回答医学专家的问题。然而,由于训练数据过时且缺乏可解释性,这种 "零 "反馈的性能受到了限制,阻碍了临床翻译。我们的目标是开发一个泌尿科专业聊天机器人(UroBot),并以完全可由临床医生验证的方式,对照最先进的模型以及历史上泌尿科医生在回答泌尿科委员会问题时的表现对其进行评估。材料与方法我们开发了 UroBot,这是一个基于 OpenAI 的 GPT-3.5、GPT-4 和 GPT-4o 模型的软件管道,利用了检索增强生成和 2023 年欧洲泌尿外科协会指南。UroBot 以 GPT-3.5、GPT-4、GPT-4o 和 Uro_Chat 的零点性能为基准进行了评估。结果UroBot-4o的RoCA最高,平均为88.4%,比GPT-4o(77.6%)高出10.8%。此外,它还可由临床医生验证,并通过 Fleiss' kappa(κ = 0.979)测量显示出运行之间的最高一致性。相比之下,根据文献报道,泌尿科医生在回答泌尿科委员会问题时的平均成绩为 68.7%。结论UroBot 是一款可由临床医生验证的准确软件管道,在回答泌尿科委员会问题方面优于已发表的模型和泌尿科医生。我们提供了使用和扩展 UroBot 的代码和说明,以便进一步开发。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines

Background

Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.

Materials and methods

We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).

Results

UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.

Conclusions

UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Utility of automated data transfer for cancer clinical trials and considerations for implementation Characterisation of oncology EHR-derived real-world data in the UK, Germany, and Japan Evolving treatment patterns and outcomes among patients with metastatic urothelial carcinoma post-avelumab maintenance approval: insights from The US Oncology Network Collaborating across sectors in service of open science, precision oncology, and patients: an overview of the AACR Project GENIE (Genomics Evidence Neoplasia Information Exchange) Biopharma Collaborative (BPC) Data analytics for real-world data integration in TKI-treated NSCLC patients using electronic health records
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1