Colleen P Flanagan, Karen Trang, Joyce Nacario, Peter A Schneider, Warren J Gasper, Michael S Conte, Elizabeth C Wick, Allan M Conway
{"title":"Large language models can accurately populate Vascular Quality Initiative procedural databases using narrative operative reports.","authors":"Colleen P Flanagan, Karen Trang, Joyce Nacario, Peter A Schneider, Warren J Gasper, Michael S Conte, Elizabeth C Wick, Allan M Conway","doi":"10.1016/j.jvs.2024.12.002","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Participation in the Vascular Quality Initiative (VQI) provides important resources to surgeons, but the ability to do so is often limited by time and data entry personnel. Large language models (LLMs) such as ChatGPT (OpenAI) are examples of generative artificial intelligence products that may help bridge this gap. Trained on large volumes of data, the models are used for natural language processing and text generation. We evaluated the ability of LLMs to accurately populate VQI procedural databases using operative reports.</p><p><strong>Methods: </strong>A single-center, retrospective study was performed using institutional VQI data from 2021 to 2023. The most recent procedures for carotid endarterectomy (CEA), endovascular aneurysm repair (EVAR), and infrainguinal lower extremity bypass (LEB) were analyzed using Versa, a HIPAA (Health Insurance Portability and Accountability Act)-compliant institutional version of ChatGPT. We created an automated function to analyze operative reports and generate a shareable VQI file using two models: gpt-35-turbo and gpt-4. Application of the LLMs was accomplished with a cloud-based programming interface. The outputs of this model were compared with VQI data for accuracy. We defined a metric as \"unavailable\" to the LLM if it was discussed by surgeons in <20% of operative reports.</p><p><strong>Results: </strong>A total of 150 operative notes were analyzed, including 50 CEA, 50 EVAR, and 50 LEB. These procedural VQI databases included 25, 179, and 51 metrics, respectively. For all fields, gpt-35-turbo had a median accuracy of 84.0% for CEA (interquartile range [IQR]: 80.0%-88.0%), 92.2% for EVAR (IQR: 87.2%-94.0%), and 84.3% for LEB (IQR: 80.2%-88.1%). A total of 3 of 25, 6 of 179, and 7 of 51 VQI variables were unavailable in the operative reports, respectively. Excluding metric information routinely unavailable in operative reports, the median accuracy rate was 95.5% for each CEA procedure (IQR: 90.9%-100.0%), 94.8% for EVAR (IQR: 92.2%-98.5%), and 93.2% for LEB (IQR: 90.2%-96.4%). Across procedures, gpt-4 did not meaningfully improve performance compared with gpt-35 (P = .97, .85, and .95 for CEA, EVAR, and LEB overall performance, respectively). The cost for 150 operative reports analyzed with gpt-35-turbo and gpt-4 was $0.12 and $3.39, respectively.</p><p><strong>Conclusions: </strong>LLMs can accurately populate VQI procedural databases with both structured and unstructured data, while incurring only minor processing costs. Increased workflow efficiency may improve center ability to successfully participate in the VQI. Further work examining other VQI databases and methods to increase accuracy is needed.</p>","PeriodicalId":17475,"journal":{"name":"Journal of Vascular Surgery","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Vascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jvs.2024.12.002","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PERIPHERAL VASCULAR DISEASE","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Participation in the Vascular Quality Initiative (VQI) provides important resources to surgeons, but the ability to do so is often limited by time and data entry personnel. Large language models (LLMs) such as ChatGPT (OpenAI) are examples of generative artificial intelligence products that may help bridge this gap. Trained on large volumes of data, the models are used for natural language processing and text generation. We evaluated the ability of LLMs to accurately populate VQI procedural databases using operative reports.
Methods: A single-center, retrospective study was performed using institutional VQI data from 2021 to 2023. The most recent procedures for carotid endarterectomy (CEA), endovascular aneurysm repair (EVAR), and infrainguinal lower extremity bypass (LEB) were analyzed using Versa, a HIPAA (Health Insurance Portability and Accountability Act)-compliant institutional version of ChatGPT. We created an automated function to analyze operative reports and generate a shareable VQI file using two models: gpt-35-turbo and gpt-4. Application of the LLMs was accomplished with a cloud-based programming interface. The outputs of this model were compared with VQI data for accuracy. We defined a metric as "unavailable" to the LLM if it was discussed by surgeons in <20% of operative reports.
Results: A total of 150 operative notes were analyzed, including 50 CEA, 50 EVAR, and 50 LEB. These procedural VQI databases included 25, 179, and 51 metrics, respectively. For all fields, gpt-35-turbo had a median accuracy of 84.0% for CEA (interquartile range [IQR]: 80.0%-88.0%), 92.2% for EVAR (IQR: 87.2%-94.0%), and 84.3% for LEB (IQR: 80.2%-88.1%). A total of 3 of 25, 6 of 179, and 7 of 51 VQI variables were unavailable in the operative reports, respectively. Excluding metric information routinely unavailable in operative reports, the median accuracy rate was 95.5% for each CEA procedure (IQR: 90.9%-100.0%), 94.8% for EVAR (IQR: 92.2%-98.5%), and 93.2% for LEB (IQR: 90.2%-96.4%). Across procedures, gpt-4 did not meaningfully improve performance compared with gpt-35 (P = .97, .85, and .95 for CEA, EVAR, and LEB overall performance, respectively). The cost for 150 operative reports analyzed with gpt-35-turbo and gpt-4 was $0.12 and $3.39, respectively.
Conclusions: LLMs can accurately populate VQI procedural databases with both structured and unstructured data, while incurring only minor processing costs. Increased workflow efficiency may improve center ability to successfully participate in the VQI. Further work examining other VQI databases and methods to increase accuracy is needed.
期刊介绍:
Journal of Vascular Surgery ® aims to be the premier international journal of medical, endovascular and surgical care of vascular diseases. It is dedicated to the science and art of vascular surgery and aims to improve the management of patients with vascular diseases by publishing relevant papers that report important medical advances, test new hypotheses, and address current controversies. To acheive this goal, the Journal will publish original clinical and laboratory studies, and reports and papers that comment on the social, economic, ethical, legal, and political factors, which relate to these aims. As the official publication of The Society for Vascular Surgery, the Journal will publish, after peer review, selected papers presented at the annual meeting of this organization and affiliated vascular societies, as well as original articles from members and non-members.