Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang
{"title":"神经外科中的大语言模型:系统回顾和荟萃分析","authors":"Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang","doi":"10.1007/s00701-024-06372-9","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.</p><h3>Methods</h3><p>We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.</p><h3>Results</h3><p>Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (<i>n</i> = 14, 27.5%), Answering Standardized Exam Questions (<i>n</i> = 12, 23.5%), and Clinical Judgement and Decision-Making Support (<i>n</i> = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (<i>n</i> = 30, 58.8%), GPT-4 (<i>n</i> = 20, 39.2%), Bard (<i>n</i> = 9, 17.6%), and Bing (<i>n</i> = 6, 11.8%). Most studies (<i>n</i> = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.</p><h3>Conclusions</h3><p>Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.</p></div>","PeriodicalId":7370,"journal":{"name":"Acta Neurochirurgica","volume":"166 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models in neurosurgery: a systematic review and meta-analysis\",\"authors\":\"Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang\",\"doi\":\"10.1007/s00701-024-06372-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.</p><h3>Methods</h3><p>We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.</p><h3>Results</h3><p>Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (<i>n</i> = 14, 27.5%), Answering Standardized Exam Questions (<i>n</i> = 12, 23.5%), and Clinical Judgement and Decision-Making Support (<i>n</i> = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (<i>n</i> = 30, 58.8%), GPT-4 (<i>n</i> = 20, 39.2%), Bard (<i>n</i> = 9, 17.6%), and Bing (<i>n</i> = 6, 11.8%). Most studies (<i>n</i> = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.</p><h3>Conclusions</h3><p>Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.</p></div>\",\"PeriodicalId\":7370,\"journal\":{\"name\":\"Acta Neurochirurgica\",\"volume\":\"166 1\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2024-11-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Neurochirurgica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s00701-024-06372-9\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Neurochirurgica","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s00701-024-06372-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
Large language models in neurosurgery: a systematic review and meta-analysis
Background
Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.
Methods
We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.
Results
Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.
Conclusions
Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.
期刊介绍:
The journal "Acta Neurochirurgica" publishes only original papers useful both to research and clinical work. Papers should deal with clinical neurosurgery - diagnosis and diagnostic techniques, operative surgery and results, postoperative treatment - or with research work in neuroscience if the underlying questions or the results are of neurosurgical interest. Reports on congresses are given in brief accounts. As official organ of the European Association of Neurosurgical Societies the journal publishes all announcements of the E.A.N.S. and reports on the activities of its member societies. Only contributions written in English will be accepted.