Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia
Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.
Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.
Methods: We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.
Results: Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.
Conclusions: Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.
背景:临床记录包含丰富但非结构化的患者数据,由于医学术语、缩写和同义词导致歧义,使得分析具有挑战性。这使得决策支持工具的实时提取变得复杂。目的:本研究旨在研究命名实体识别(NER)管道的数据管理、技术和工作流程,这是一个更广泛的临床决策支持工具的组成部分,该工具使用NER模型识别关键实体,并通过NER断言模型将这些实体分类为患者存在或不存在。方法:收集5000例患者的进展监护、影像学、病理记录,分为5批,每批1000例。每个病人的笔记和报告、句子数、令牌大小、运行时间、中央处理单元和内存使用等指标被测量为每个笔记类型。我们还评估了NER输出的精度,然后根据临床专家的手动注释评估了NER断言模型的精度和召回率。结果:使用Spark自然语言处理临床预训练NER模型对138,250份临床记录进行处理,我们观察到良好的NER精度,程序中的峰值为0.989 (95% CI 0.977-1.000),断言模型的准确率为0.889 (95% CI 0.856-0.922)。我们的分析强调了每位患者笔记、笔记长度和实体密度的长尾分布。进展护理笔记比放射学和病理学笔记每句有更多的实体,分别显示4倍和16倍的差异。结论:进一步的研究应探索本研究范围之外的临床记录分析,包括出院总结和精神病学评估记录。认识到不同音符类型的独特语言特征强调了针对每种类型开发专门的NER模型或自然语言处理管道设置的重要性。通过这样做,我们可以在更多样化的临床场景中提高他们的表现。
{"title":"Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study.","authors":"Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia","doi":"10.2196/59251","DOIUrl":"10.2196/59251","url":null,"abstract":"<p><strong>Background: </strong>Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.</p><p><strong>Objective: </strong>This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.</p><p><strong>Methods: </strong>We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.</p><p><strong>Results: </strong>Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.</p><p><strong>Conclusions: </strong>Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e59251"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni
Background: Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.
Objective: This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.
Methods: Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.
Results: Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.
Conclusions: The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.
{"title":"Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis.","authors":"Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni","doi":"10.2196/68592","DOIUrl":"10.2196/68592","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.</p><p><strong>Objective: </strong>This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.</p><p><strong>Methods: </strong>Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.</p><p><strong>Results: </strong>Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.</p><p><strong>Conclusions: </strong>The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68592"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati
Background: Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.
Objective: The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.
Methods: We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.
Results: The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).
Conclusions: Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.
{"title":"AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study.","authors":"Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati","doi":"10.2196/67302","DOIUrl":"10.2196/67302","url":null,"abstract":"<p><strong>Background: </strong>Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.</p><p><strong>Objective: </strong>The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.</p><p><strong>Methods: </strong>We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.</p><p><strong>Results: </strong>The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).</p><p><strong>Conclusions: </strong>Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67302"},"PeriodicalIF":2.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner
<p><strong>Background: </strong>Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.</p><p><strong>Objective: </strong>The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.</p><p><strong>Methods: </strong>This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.</p><p><strong>Results: </strong>Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.</p><p><strong>Conclusions: </strong>This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to
{"title":"Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.","authors":"Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner","doi":"10.2196/55001","DOIUrl":"10.2196/55001","url":null,"abstract":"<p><strong>Background: </strong>Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.</p><p><strong>Objective: </strong>The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.</p><p><strong>Methods: </strong>This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.</p><p><strong>Results: </strong>Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.</p><p><strong>Conclusions: </strong>This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e55001"},"PeriodicalIF":2.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12392689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault
<p><strong>Background: </strong>Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.</p><p><strong>Objective: </strong>Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.</p><p><strong>Methods: </strong>This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).</p><p><strong>Results: </strong>Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F
{"title":"Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study.","authors":"Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault","doi":"10.2196/60885","DOIUrl":"10.2196/60885","url":null,"abstract":"<p><strong>Background: </strong>Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.</p><p><strong>Objective: </strong>Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.</p><p><strong>Methods: </strong>This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).</p><p><strong>Results: </strong>Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e60885"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum
Background: Since the release of ChatGPT and other large language models (LLMs), there has been a significant increase in academic publications exploring their capabilities and implications across various fields, such as medicine, education, and technology.
Objective: This study aims to identify the most influential academic works on LLMs published in the past year, categorize their research types and thematic focuses, within different professional fields. The study also evaluates the ability of artificial intelligence (AI) tools, such as ChatGPT, to accurately classify academic research.
Methods: We conducted a bibliometric analysis using Clarivate's Web of Science (WOS) to extract the top 100 most cited papers on LLMs. Papers were manually categorized by field, journal, author, and research type. ChatGPT-4 was used to generate categorizations for the same papers, and its performance was compared to human classifications. We summarized the distribution of research fields and assessed the concordance between AI-generated and manual classifications.
Results: Medicine emerged as the predominant field among the top 100 most cited papers, accounting for 43 (43%), followed by education 26 (26%) and technology 15 (15%). Medical literature primarily focused on clinical applications of LLMs, limitations of AI in health care, and the role of AI in medical education. In education, research was centered around ethical concerns and potential applications of AI for teaching and learning. ChatGPT demonstrated variable concordance with human reviewers, achieving an agreement rating of 47% for research types and 92% for fields of study.
Conclusions: While LLMs such as ChatGPT exhibit considerable potential in aiding research categorization, human oversight remains essential to address issues such as hallucinations, outdated information, and biases in AI-generated outputs. This study highlights the transformative potential of LLMs across multiple sectors and emphasizes the importance of continuous ethical evaluation and iterative improvement of AI systems to maximize their benefits while minimizing risks.
背景:自从ChatGPT和其他大型语言模型(llm)发布以来,探索它们在各个领域(如医学、教育和技术)的能力和含义的学术出版物显著增加。目的:本研究旨在识别近一年来在不同专业领域发表的最具影响力的法学硕士学术著作,并对其研究类型和主题重点进行分类。该研究还评估了ChatGPT等人工智能(AI)工具对学术研究进行准确分类的能力。方法:利用Clarivate的Web of Science (WOS)进行文献计量学分析,提取被引频次前100位的法学硕士论文。论文按领域、期刊、作者和研究类型手工分类。ChatGPT-4被用于为相同的论文生成分类,并将其性能与人类分类进行比较。我们总结了研究领域的分布,并评估了人工智能生成的分类与人工分类之间的一致性。结果:在前100篇被引论文中,医学领域占主导地位,占43篇(43%),其次是教育26篇(26%),技术15篇(15%)。医学文献主要关注法学硕士的临床应用、人工智能在医疗保健中的局限性以及人工智能在医学教育中的作用。在教育方面,研究主要围绕伦理问题和人工智能在教学和学习方面的潜在应用。ChatGPT显示了与人类审稿人的可变一致性,在研究类型和研究领域的一致性评分分别达到47%和92%。结论:虽然像ChatGPT这样的法学硕士在帮助研究分类方面表现出相当大的潜力,但人类的监督对于解决诸如幻觉、过时信息和人工智能生成输出中的偏见等问题仍然至关重要。本研究强调了法学硕士在多个领域的变革潜力,并强调了持续的道德评估和人工智能系统的迭代改进的重要性,以最大限度地提高其收益,同时最大限度地降低风险。
{"title":"Identification and Categorization of the Top 100 Articles and the Future of Large Language Models: Thematic Analysis Using Bibliometric Analysis.","authors":"Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum","doi":"10.2196/68603","DOIUrl":"10.2196/68603","url":null,"abstract":"<p><strong>Background: </strong>Since the release of ChatGPT and other large language models (LLMs), there has been a significant increase in academic publications exploring their capabilities and implications across various fields, such as medicine, education, and technology.</p><p><strong>Objective: </strong>This study aims to identify the most influential academic works on LLMs published in the past year, categorize their research types and thematic focuses, within different professional fields. The study also evaluates the ability of artificial intelligence (AI) tools, such as ChatGPT, to accurately classify academic research.</p><p><strong>Methods: </strong>We conducted a bibliometric analysis using Clarivate's Web of Science (WOS) to extract the top 100 most cited papers on LLMs. Papers were manually categorized by field, journal, author, and research type. ChatGPT-4 was used to generate categorizations for the same papers, and its performance was compared to human classifications. We summarized the distribution of research fields and assessed the concordance between AI-generated and manual classifications.</p><p><strong>Results: </strong>Medicine emerged as the predominant field among the top 100 most cited papers, accounting for 43 (43%), followed by education 26 (26%) and technology 15 (15%). Medical literature primarily focused on clinical applications of LLMs, limitations of AI in health care, and the role of AI in medical education. In education, research was centered around ethical concerns and potential applications of AI for teaching and learning. ChatGPT demonstrated variable concordance with human reviewers, achieving an agreement rating of 47% for research types and 92% for fields of study.</p><p><strong>Conclusions: </strong>While LLMs such as ChatGPT exhibit considerable potential in aiding research categorization, human oversight remains essential to address issues such as hallucinations, outdated information, and biases in AI-generated outputs. This study highlights the transformative potential of LLMs across multiple sectors and emphasizes the importance of continuous ethical evaluation and iterative improvement of AI systems to maximize their benefits while minimizing risks.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68603"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions.
Objective: The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.
Methods: We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates.
Results: DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.
Conclusions: DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.
{"title":"Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative Evaluation.","authors":"Masab Mansoor, Andrew Ibrahim, Ali Hamide","doi":"10.2196/76056","DOIUrl":"10.2196/76056","url":null,"abstract":"<p><strong>Background: </strong>Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions.</p><p><strong>Objective: </strong>The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.</p><p><strong>Methods: </strong>We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates.</p><p><strong>Results: </strong>DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.</p><p><strong>Conclusions: </strong>DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76056"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Intensive care units (ICUs) treat patients with life-threatening illnesses. Worldwide, intensive care demand is massive. Predicting patient outcomes in ICUs holds significant importance for health care operation management. Nevertheless, it remains a challenging problem that researchers and health care practitioners have yet to overcome. While the newly emerging health digital trace data offer new possibilities, such data contain complex time series and patterns. Although researchers have devised severity score systems, traditional machine learning models with feature engineering, and deep learning models that use raw clinical data to predict ICU outcomes, existing methods have limitations.
Objective: This study aimed to develop a novel feature extraction and machine learning framework to repurpose and extract features with strong predictive power from patients' health digital traces for ICU outcome prediction.
Methods: Guided by signal processing techniques and medical domain knowledge, the proposed framework introduces a novel, signal processing-based feature engineering method to extract highly predictive features from ICU digital trace data. We rigorously evaluated this method on a real-world ICU dataset, demonstrating significant improvements over both traditional and deep learning baseline methods. The method was then evaluated using a real-world database to assess prediction accuracy and feature representativeness.
Results: The prediction results obtained by the proposed framework significantly outperformed state-of-the-art benchmarks. This demonstrated the framework's effectiveness in capturing key patterns from complex health digital traces for improving ICU outcome prediction.
Conclusions: Our study contributes to health care operation management by leveraging digital traces from health care information systems to address challenges with significant implications for health care.
{"title":"Intensive Care Unit Patient Outcome Prediction Using ν-Support Vector Classification and Stochastic Signal Processing-Based Feature Extraction Techniques: Algorithm Development and Validation Study.","authors":"Shaodong Wang, Yiqun Jiang, Qing Li, Wenli Zhang","doi":"10.2196/72671","DOIUrl":"10.2196/72671","url":null,"abstract":"<p><strong>Background: </strong>Intensive care units (ICUs) treat patients with life-threatening illnesses. Worldwide, intensive care demand is massive. Predicting patient outcomes in ICUs holds significant importance for health care operation management. Nevertheless, it remains a challenging problem that researchers and health care practitioners have yet to overcome. While the newly emerging health digital trace data offer new possibilities, such data contain complex time series and patterns. Although researchers have devised severity score systems, traditional machine learning models with feature engineering, and deep learning models that use raw clinical data to predict ICU outcomes, existing methods have limitations.</p><p><strong>Objective: </strong>This study aimed to develop a novel feature extraction and machine learning framework to repurpose and extract features with strong predictive power from patients' health digital traces for ICU outcome prediction.</p><p><strong>Methods: </strong>Guided by signal processing techniques and medical domain knowledge, the proposed framework introduces a novel, signal processing-based feature engineering method to extract highly predictive features from ICU digital trace data. We rigorously evaluated this method on a real-world ICU dataset, demonstrating significant improvements over both traditional and deep learning baseline methods. The method was then evaluated using a real-world database to assess prediction accuracy and feature representativeness.</p><p><strong>Results: </strong>The prediction results obtained by the proposed framework significantly outperformed state-of-the-art benchmarks. This demonstrated the framework's effectiveness in capturing key patterns from complex health digital traces for improving ICU outcome prediction.</p><p><strong>Conclusions: </strong>Our study contributes to health care operation management by leveraging digital traces from health care information systems to address challenges with significant implications for health care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72671"},"PeriodicalIF":2.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12421204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phuong Dinh Ngo, Miguel Ángel Tejedor Hernández, Taridzo Chomutare, Andrius Budrionis, Therese Olsen Svenning, Torbjørn Torsvik, Anastasios Lamproudis, Hercules Dalianis
<p><strong>Background: </strong>Accurately assigning ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes is critical for clinical documentation, reimbursement processes, epidemiological studies, and health care planning. Manual coding is time-consuming, labor-intensive, and prone to errors, underscoring the need for automated solutions within the Norwegian health care system. Recent advances in natural language processing (NLP) and transformer-based language models have shown promising results in automating ICD (International Classification of Diseases) coding in several languages. However, prior work has focused primarily on English and other high-resource languages, leaving a gap in Norwegian-specific clinical NLP research.</p><p><strong>Objective: </strong>This study introduces 2 versions of NorDeClin-BERT (NorDeClin Bidirectional Encoder Representations from Transformers), domain-specific Norwegian BERT-based models pretrained on a large corpus of Norwegian clinical text to enhance their understanding of medical language. Both models were subsequently fine-tuned to predict ICD-10 diagnosis codes. We aimed to evaluate the impact of domain-specific pretraining and model size on classification performance and to compare NorDeClin-BERT with general-purpose and cross-lingual BERT models in the context of Norwegian ICD-10 coding.</p><p><strong>Methods: </strong>Two versions of NorDeClin-BERT were pretrained on the ClinCode Gastro Corpus, a large-scale dataset comprising 8.8 million deidentified Norwegian clinical notes, to enhance domain-specific language modeling. The base model builds upon NorBERT3-base and was pretrained on a large, relevant subset of the corpus, while the large model builds upon NorBERT3-large and was trained on the full dataset. Both models were benchmarked against SweDeClin-BERT, ScandiBERT, NorBERT3-base, and NorBERT3-large, using standard evaluation metrics: accuracy, precision, recall, and F1-score.</p><p><strong>Results: </strong>The results show that both versions of NorDeClin-BERT outperformed general-purpose Norwegian BERT models and Swedish clinical BERT models in classifying both prevalent and less common ICD-10 codes. Notably, NorDeClin-BERT-large achieved the highest overall performance across evaluation metrics, demonstrating the impact of domain-specific clinical pretraining in Norwegian. These results highlight that domain-specific pretraining on Norwegian clinical text, combined with model capacity, improves ICD-10 classification accuracy compared with general-domain Norwegian models and Swedish models pretrained on clinical text. Furthermore, while Swedish clinical models demonstrated some transferability to Norwegian, their performance remained suboptimal, emphasizing the necessity of Norwegian-specific clinical pretraining.</p><p><strong>Conclusions: </strong>This study highlights the potential of NorDeClin-BERT to improve ICD-10 code classification for the gastroenterology do
{"title":"Domain-Specific Pretraining of NorDeClin-Bidirectional Encoder Representations From Transformers for <i>International Statistical Classification of Diseases, Tenth Revision,</i> Code Prediction in Norwegian Clinical Texts: Model Development and Evaluation Study.","authors":"Phuong Dinh Ngo, Miguel Ángel Tejedor Hernández, Taridzo Chomutare, Andrius Budrionis, Therese Olsen Svenning, Torbjørn Torsvik, Anastasios Lamproudis, Hercules Dalianis","doi":"10.2196/66153","DOIUrl":"10.2196/66153","url":null,"abstract":"<p><strong>Background: </strong>Accurately assigning ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes is critical for clinical documentation, reimbursement processes, epidemiological studies, and health care planning. Manual coding is time-consuming, labor-intensive, and prone to errors, underscoring the need for automated solutions within the Norwegian health care system. Recent advances in natural language processing (NLP) and transformer-based language models have shown promising results in automating ICD (International Classification of Diseases) coding in several languages. However, prior work has focused primarily on English and other high-resource languages, leaving a gap in Norwegian-specific clinical NLP research.</p><p><strong>Objective: </strong>This study introduces 2 versions of NorDeClin-BERT (NorDeClin Bidirectional Encoder Representations from Transformers), domain-specific Norwegian BERT-based models pretrained on a large corpus of Norwegian clinical text to enhance their understanding of medical language. Both models were subsequently fine-tuned to predict ICD-10 diagnosis codes. We aimed to evaluate the impact of domain-specific pretraining and model size on classification performance and to compare NorDeClin-BERT with general-purpose and cross-lingual BERT models in the context of Norwegian ICD-10 coding.</p><p><strong>Methods: </strong>Two versions of NorDeClin-BERT were pretrained on the ClinCode Gastro Corpus, a large-scale dataset comprising 8.8 million deidentified Norwegian clinical notes, to enhance domain-specific language modeling. The base model builds upon NorBERT3-base and was pretrained on a large, relevant subset of the corpus, while the large model builds upon NorBERT3-large and was trained on the full dataset. Both models were benchmarked against SweDeClin-BERT, ScandiBERT, NorBERT3-base, and NorBERT3-large, using standard evaluation metrics: accuracy, precision, recall, and F1-score.</p><p><strong>Results: </strong>The results show that both versions of NorDeClin-BERT outperformed general-purpose Norwegian BERT models and Swedish clinical BERT models in classifying both prevalent and less common ICD-10 codes. Notably, NorDeClin-BERT-large achieved the highest overall performance across evaluation metrics, demonstrating the impact of domain-specific clinical pretraining in Norwegian. These results highlight that domain-specific pretraining on Norwegian clinical text, combined with model capacity, improves ICD-10 classification accuracy compared with general-domain Norwegian models and Swedish models pretrained on clinical text. Furthermore, while Swedish clinical models demonstrated some transferability to Norwegian, their performance remained suboptimal, emphasizing the necessity of Norwegian-specific clinical pretraining.</p><p><strong>Conclusions: </strong>This study highlights the potential of NorDeClin-BERT to improve ICD-10 code classification for the gastroenterology do","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e66153"},"PeriodicalIF":2.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377785/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haya Engelstein, Roni Ramon-Gonen, Avi Sabbag, Eyal Klang, Karin Sudri, Michal Cohen-Shelly, Israel Barbash
Background: Recent progress has demonstrated the potential of deep learning models in analyzing electrocardiogram (ECG) pathologies. However, this method is intricate, expensive to develop, and designed for specific purposes. Large language models show promise in medical image interpretation, and yet their effectiveness in ECG analysis remains understudied. Generative Pretrained Transformer 4 Omni (GPT-4o), a multimodal artificial intelligence model, capable of processing images and text without task-specific training, may offer an accessible alternative.
Objective: This study aimed to evaluate GPT-4o's effectiveness in interpreting 12-lead ECGs, assessing classification accuracy, and exploring methods to enhance its performance.
Methods: A total of 6 common ECG diagnoses were evaluated: normal ECG, ST-segment elevation myocardial infarction, atrial fibrillation, right bundle branch block, left bundle branch block, and paced rhythm, with 30 normal ECGs and 10 of each abnormal pattern, totaling 80 cases. Deidentified ECGs were analyzed using OpenAI's GPT-4o. Our study used both zero-shot and few-shot learning methodologies to investigate three main scenarios: (1) ECG image recognition, (2) binary classification of normal versus abnormal ECGs, and (3) multiclass classification into 6 categories.
Results: The model excelled in recognizing ECG images, achieving an accuracy of 100%. In the classification of normal or abnormal ECG cases, the few-shot learning approach improved GPT-4o's accuracy by 30% from the baseline, reaching 83% (95% CI 81.8%-84.6%). However, multiclass classification for a specific pathology remained limited, achieving only 41% accuracy.
Conclusions: GPT-4o effectively differentiates normal from abnormal ECGs, suggesting its potential as an accessible artificial intelligence-assisted triage tool. Although limited in diagnosing specific cardiac conditions, GPT-4o's capability to interpret ECG images without specialized training highlights its potential for preliminary ECG interpretation in clinical and remote settings.
背景:最近的进展已经证明了深度学习模型在分析心电图(ECG)病理方面的潜力。然而,这种方法复杂,开发成本高,并且是为特定目的而设计的。大型语言模型在医学图像解释中显示出前景,但其在心电图分析中的有效性仍有待研究。生成式预训练Transformer 4 Omni (gpt - 40)是一种多模式人工智能模型,无需特定任务训练即可处理图像和文本,可能是一种可访问的替代方案。目的:本研究旨在评价gpt - 40在12导联心电图解释中的有效性,评估其分类准确性,并探索提高其性能的方法。方法:对心电图正常、st段抬高型心肌梗死、心房颤动、右束支传导阻滞、左束支传导阻滞、心律失常6项常见心电图诊断进行评价,其中正常心电图30例,各异常模式10例,共80例。使用OpenAI的gpt - 40分析鉴定的心电图。我们的研究使用零次和少次学习方法来研究三个主要场景:(1)心电图像识别;(2)正常与异常心电图的二值分类;(3)多类分类,分为6类。结果:该模型具有较好的心电图像识别能力,准确率达到100%。在ECG正常或异常病例的分类中,少射学习方法将gpt - 40的准确率从基线提高了30%,达到83% (95% CI 81.8%-84.6%)。然而,对特定病理的多分类仍然有限,准确率仅为41%。结论:gpt - 40可有效区分正常和异常心电图,提示其作为一种可获得的人工智能辅助分诊工具的潜力。虽然在诊断特定的心脏疾病方面有限,但gpt - 40在没有专门培训的情况下解释心电图图像的能力突出了其在临床和远程环境中进行初步心电图解释的潜力。
{"title":"Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study.","authors":"Haya Engelstein, Roni Ramon-Gonen, Avi Sabbag, Eyal Klang, Karin Sudri, Michal Cohen-Shelly, Israel Barbash","doi":"10.2196/74426","DOIUrl":"10.2196/74426","url":null,"abstract":"<p><strong>Background: </strong>Recent progress has demonstrated the potential of deep learning models in analyzing electrocardiogram (ECG) pathologies. However, this method is intricate, expensive to develop, and designed for specific purposes. Large language models show promise in medical image interpretation, and yet their effectiveness in ECG analysis remains understudied. Generative Pretrained Transformer 4 Omni (GPT-4o), a multimodal artificial intelligence model, capable of processing images and text without task-specific training, may offer an accessible alternative.</p><p><strong>Objective: </strong>This study aimed to evaluate GPT-4o's effectiveness in interpreting 12-lead ECGs, assessing classification accuracy, and exploring methods to enhance its performance.</p><p><strong>Methods: </strong>A total of 6 common ECG diagnoses were evaluated: normal ECG, ST-segment elevation myocardial infarction, atrial fibrillation, right bundle branch block, left bundle branch block, and paced rhythm, with 30 normal ECGs and 10 of each abnormal pattern, totaling 80 cases. Deidentified ECGs were analyzed using OpenAI's GPT-4o. Our study used both zero-shot and few-shot learning methodologies to investigate three main scenarios: (1) ECG image recognition, (2) binary classification of normal versus abnormal ECGs, and (3) multiclass classification into 6 categories.</p><p><strong>Results: </strong>The model excelled in recognizing ECG images, achieving an accuracy of 100%. In the classification of normal or abnormal ECG cases, the few-shot learning approach improved GPT-4o's accuracy by 30% from the baseline, reaching 83% (95% CI 81.8%-84.6%). However, multiclass classification for a specific pathology remained limited, achieving only 41% accuracy.</p><p><strong>Conclusions: </strong>GPT-4o effectively differentiates normal from abnormal ECGs, suggesting its potential as an accessible artificial intelligence-assisted triage tool. Although limited in diagnosing specific cardiac conditions, GPT-4o's capability to interpret ECG images without specialized training highlights its potential for preliminary ECG interpretation in clinical and remote settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e74426"},"PeriodicalIF":2.0,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12375907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}