Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)
{"title":"专有大语言模型在标记产科事故报告中的准确性。","authors":"Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)","doi":"10.1016/j.jcjq.2024.08.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.</div></div><div><h3>Methods</h3><div>A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.</div></div><div><h3>Results</h3><div>The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.</div></div><div><h3>Conclusion</h3><div>The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.</div></div>","PeriodicalId":14835,"journal":{"name":"Joint Commission journal on quality and patient safety","volume":"50 12","pages":"Pages 877-881"},"PeriodicalIF":2.3000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports\",\"authors\":\"Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)\",\"doi\":\"10.1016/j.jcjq.2024.08.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.</div></div><div><h3>Methods</h3><div>A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.</div></div><div><h3>Results</h3><div>The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.</div></div><div><h3>Conclusion</h3><div>The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.</div></div>\",\"PeriodicalId\":14835,\"journal\":{\"name\":\"Joint Commission journal on quality and patient safety\",\"volume\":\"50 12\",\"pages\":\"Pages 877-881\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Joint Commission journal on quality and patient safety\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1553725024002332\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Commission journal on quality and patient safety","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1553725024002332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports
Background
Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.
Methods
A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.
Results
The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.
Conclusion
The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.