JMIR AI最新文献_第10页

Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study. 探索命名实体识别潜力和定制的自然语言处理管道的价值，用于放射学，病理学和临床决策支持的进展记录：定量研究。

IF 2

JMIR AI

Pub Date : 2025-09-05 DOI: 10.2196/59251

Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia

Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.

Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.

Methods: We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.

Results: Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.

Conclusions: Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.

背景：临床记录包含丰富但非结构化的患者数据，由于医学术语、缩写和同义词导致歧义，使得分析具有挑战性。这使得决策支持工具的实时提取变得复杂。目的：本研究旨在研究命名实体识别（NER）管道的数据管理、技术和工作流程，这是一个更广泛的临床决策支持工具的组成部分，该工具使用NER模型识别关键实体，并通过NER断言模型将这些实体分类为患者存在或不存在。方法：收集5000例患者的进展监护、影像学、病理记录，分为5批，每批1000例。每个病人的笔记和报告、句子数、令牌大小、运行时间、中央处理单元和内存使用等指标被测量为每个笔记类型。我们还评估了NER输出的精度，然后根据临床专家的手动注释评估了NER断言模型的精度和召回率。结果：使用Spark自然语言处理临床预训练NER模型对138,250份临床记录进行处理，我们观察到良好的NER精度，程序中的峰值为0.989 (95% CI 0.977-1.000)，断言模型的准确率为0.889 （95% CI 0.856-0.922）。我们的分析强调了每位患者笔记、笔记长度和实体密度的长尾分布。进展护理笔记比放射学和病理学笔记每句有更多的实体，分别显示4倍和16倍的差异。结论：进一步的研究应探索本研究范围之外的临床记录分析，包括出院总结和精神病学评估记录。认识到不同音符类型的独特语言特征强调了针对每种类型开发专门的NER模型或自然语言处理管道设置的重要性。通过这样做，我们可以在更多样化的临床场景中提高他们的表现。

{"title":"Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study.","authors":"Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia","doi":"10.2196/59251","DOIUrl":"10.2196/59251","url":null,"abstract":"Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.Methods: We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.Results: Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.Conclusions: Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e59251"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis. 青光眼系统评价中文献检索、数据提取和研究组成的AI工具与PRISMA方法的评价：内容分析

IF 2

JMIR AI

Pub Date : 2025-09-05 DOI: 10.2196/68592

Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni

Background: Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.

Objective: This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.

Methods: Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.

Results: Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.

Conclusions: The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.

背景：人工智能（AI）在科学领域越来越受欢迎，因为它允许分析大量数据集，总结结果并协助撰写学术论文。目的：本研究以青光眼相关单反为案例研究，以基于系统评价和meta分析（PRISMA）的单反首选报告项目为基准，探讨人工智能在进行系统文献综述（SLR）过程中的作用，重点关注其在发展、研究选择、数据提取和研究组成三个关键阶段的贡献和局限性。方法：测试四个人工智能平台复制四个基于prisma的青光眼相关单反的能力。我们使用Connected Papers和Elicit对相关记录进行研究；然后我们评估了Elicit和ChatPDF提取和组织检索记录中包含的信息的能力。最后，我们测试了Jenni AI的单反构图能力。结果：Connected Papers和Elicit都没有提供使用PRISMA方法发现的全部结果。平均而言，从Elicit中提取的数据准确率为51.40% (SD 31.45%)，不准确率为13.69% (SD 17.98%)；22.37%（标准差27.54%）的回答缺失，12.51%（标准差14.70%）的回答错误。从ChatPDF中提取的数据准确率为60.33% (SD 30.72%)，不准确率为7.41% (SD 13.88%)；17.56% （SD 20.02%）的回答缺失，14.70% （SD 17.72%）的回答错误。Jenni AI生成的内容表现出令人满意的语言流畅性和技术熟练程度，但在定义方法、阐述结果和陈述结论方面存在不足。结论：在单反书写过程的文献检索、数据提取和研究组成阶段，PRISMA方法在再现性和准确性方面继续表现出明显的优势。虽然人工智能可以节省时间并帮助完成重复性任务，但研究人员在整个过程中的积极参与对于保持对其工作质量，准确性和客观性的控制仍然至关重要。

{"title":"Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis.","authors":"Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni","doi":"10.2196/68592","DOIUrl":"10.2196/68592","url":null,"abstract":"Background: Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.Objective: This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.Methods: Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.Results: Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.Conclusions: The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68592"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study. 移植护理中人工智能驱动的他克莫司剂量：队列研究。

IF 2

JMIR AI

Pub Date : 2025-09-02 DOI: 10.2196/67302

Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati

Background: Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.

Objective: The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.

Methods: We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.

Results: The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).

Conclusions: Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.

背景：他克莫司是实体器官移植中免疫抑制治疗的支柱，由于其治疗范围窄，需要精确的剂量。由于患者特点、供体器官因素、药物相互作用和围手术期生理变化的不同，术后维持他克莫司治疗水平具有挑战性。目的：设计一个机器学习模型来预测第二天他克莫司谷浓度（C0），并指导给药，以防止持续给药不足或过量。方法：我们使用来自加州大学圣地亚哥分校健康中心1597名成年肾脏和肝脏移植接受者的回顾性数据来开发一个长短期记忆（LSTM）模型来预测住院患者第二天服用他克莫司C0的情况。预测因素包括移植类型、人口统计学、合并症、生命体征、实验室参数、有序饮食和药物。对模型进行排列特征重要性评估。我们进一步实施了一个分类任务来评估模型识别剂量不足、治疗剂量和过量的能力。最后，我们提出了第二天的剂量建议，使他克莫司C0达到目标范围。结果：LSTM模型预测第二天他克莫司C0的平均绝对误差为1.880 ng/mL。最重要的预测特征包括最近的他克莫司C0、他克莫司剂量、移植器官类型、饮食和相互作用药物。在使用3类分类任务预测剂量不足、治疗剂量和过量剂量时，该模型的微平均f1得分为0.653。对于推荐剂量，当实际每日总剂量与模型推荐剂量（在3mg以内）密切一致时，达到最佳临床结果。结论：我们的研究是将人工智能应用于他克莫司给药的最大研究之一，我们的LSTM模型可以有效地预测他克莫司C0，并可能指导准确的剂量推荐。需要进一步的前瞻性研究来评估该模型在实际剂量调整中的性能。

{"title":"AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study.","authors":"Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati","doi":"10.2196/67302","DOIUrl":"10.2196/67302","url":null,"abstract":"Background: Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.Objective: The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.Methods: We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.Results: The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).Conclusions: Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67302"},"PeriodicalIF":2.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study. 医学专家知识与人工智能相结合，提高法布里病罕见病识别症状检查器的性能：混合方法研究。

IF 2

JMIR AI

Pub Date : 2025-08-28 DOI: 10.2196/55001

Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner

Background: Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.Objective: The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.Methods: This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.Results: Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.Conclusions: This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to

背景：影响全世界数百万人的罕见疾病是一项重大挑战，因为往往需要数年时间才能做出准确诊断。由于误诊导致治疗不足和费用增加，这种延误给患者和卫生保健系统带来了沉重负担。人工智能（AI）驱动的症状检查器（SCs）提供了在诊断工作中早期标记罕见疾病的机会。然而，这些工具主要基于已发表的文献，这些文献通常包含罕见疾病的不完整数据，导致诊断准确性受到损害。将专家访谈的见解整合到SC模型中可以提高它们的性能，确保更快地考虑罕见疾病并更准确地诊断。目的：本研究的目的是在传统的文献综述之外，将专家访谈视频纳入人工智能驱动的SCs，并评估这种新方法是否提高了罕见病的诊断准确性和用户满意度，重点是法布里病。方法：这项混合方法前瞻性先导研究在德国汉诺威医学院进行。在第一阶段，与专门研究法布里病的医学专家进行了指导访谈，以创建临床小片段，丰富人工智能SC的法布里病模型。在第二阶段，确诊为Fabry病的成年患者以随机顺序使用原始和优化的SC版本。包含原始或优化法布里疾病模型的版本，根据诊断准确性和用户满意度进行评估，并通过问卷进行评估。结果：3位在法布里病溶酶体贮积障碍方面经验丰富的医学专家参与了5个临床小片段的创建，这些小片段被整合到人工智能驱动的SC中，研究比较了6例法布里病患者的原始版本和优化版本。优化后的版本提高了诊断的准确性，33%（2/6）的病例将Fabry病确定为最高建议，而原始模型为17%（1/6）。此外，优化版本的总体用户满意度更高，参与者在症状覆盖和完整性方面对其进行了更有利的评价。结论：本研究表明，将专家衍生的临床小插曲整合到人工智能驱动的SCs中可以提高诊断的准确性和用户满意度，特别是对于罕见疾病。与原始版本相比，优化后的SC版本在将法布里病识别为最高诊断建议方面表现出更好的性能，并获得了更高的用户满意度评级。为了充分实现这种方法的潜力，包括代表非典型表现的小插曲和进行更大规模的研究来验证这些发现是至关重要的。

{"title":"Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.","authors":"Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner","doi":"10.2196/55001","DOIUrl":"10.2196/55001","url":null,"abstract":"Background: Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.Objective: The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.Methods: This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.Results: Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.Conclusions: This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e55001"},"PeriodicalIF":2.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12392689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study. 利用常规生理参数和人工智能预测重症监护病房低警觉性发作：衍生研究。

IF 2

JMIR AI

Pub Date : 2025-08-27 DOI: 10.2196/60885

Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault

Background: Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.Objective: Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.Methods: This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).Results: Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F

背景：谵妄常见于重症监护病房（icu），常导致不良后果。低活动性谵妄尤其难以发现。尽管开发了新的工具，但由于其动态性，缺乏人力资源，缺乏可靠的监测工具以及包括警惕性低下在内的微妙临床症状，及时识别低活动性谵妄在临床上仍然具有挑战性。机器学习模型可以通过更好地检测低警觉性发作来支持识别低活性谵妄发作。目的：建立一种人工智能预测模型，利用常规收集的ICU生理数据检测低警觉性事件。方法：本衍生研究的数据来自加拿大魁地省魁地省的一项符合条件的ICU患者的前瞻性观察队列。我们纳入了2021年10月至2022年6月期间入住ICU的患者，年龄≥18岁，预计ICU住院时间≥48小时。ICU护士每小时使用Richmond躁动与镇静量表（RASS）或Ramsay镇静量表（RSS）识别低警惕性状态。常规生命体征（心率、呼吸频率、血压和血氧饱和度）以及其他生理和临床变量（室性早搏、插管、镇静药物使用和体温）均使用CARESCAPE Gateway（通用电气）自动收集和存储，或通过图表审查手动收集（用于社会人口统计学特征和药物）。时间序列是围绕低警惕性发作生成的，用于分析。然后使用随机森林、XGBoost和光梯度增强机分类器来检测基于时间序列分析的低警惕性事件。超参数优化使用随机搜索在10倍组为基础的交叉验证设置。为了解释表现最好的模型的预测，我们进行了Shapley加性解释（SHAP）分析。我们使用TRIPOD+AI（透明报告个体预后或机器学习模型诊断的多变量预测模型）指南报告了本研究的结果，并使用PROBAST（预测模型偏倚风险评估工具）评估了潜在的偏倚。结果：在136名可能符合条件的参与者中，收集了30名患者（平均年龄69岁，63%为男性）的数据进行分析。在所有参与者中，30%因手术原因入住ICU。数据预处理后，研究包括1493次低警惕性发作和764次非低警惕性发作。在评估的3种模型中，光梯度增强机表现出最好的性能。它检测低警惕性发作的平均准确率为68%，准确率为76%，召回率为74%，曲线下面积（AUC）为60%，f1评分为69%。SHAP分析显示，插管状态、呼吸频率和无创收缩压是模型预测的主要驱动因素。结论：所有分类器都产生了精度和召回值，显示出进一步发展的潜力，在分类低警惕性发作方面表现略有不同，但具有可比性。设计用于检测低警觉性的机器学习算法有可能支持ICU患者低活性谵妄的早期检测。

{"title":"Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study.","authors":"Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault","doi":"10.2196/60885","DOIUrl":"10.2196/60885","url":null,"abstract":"Background: Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.Objective: Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.Methods: This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).Results: Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e60885"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identification and Categorization of the Top 100 Articles and the Future of Large Language Models: Thematic Analysis Using Bibliometric Analysis. 前100篇文章的识别和分类以及大型语言模型的未来：使用文献计量学分析的主题分析。

IF 2

JMIR AI

Pub Date : 2025-08-27 DOI: 10.2196/68603

Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum

Background: Since the release of ChatGPT and other large language models (LLMs), there has been a significant increase in academic publications exploring their capabilities and implications across various fields, such as medicine, education, and technology.

Objective: This study aims to identify the most influential academic works on LLMs published in the past year, categorize their research types and thematic focuses, within different professional fields. The study also evaluates the ability of artificial intelligence (AI) tools, such as ChatGPT, to accurately classify academic research.

Methods: We conducted a bibliometric analysis using Clarivate's Web of Science (WOS) to extract the top 100 most cited papers on LLMs. Papers were manually categorized by field, journal, author, and research type. ChatGPT-4 was used to generate categorizations for the same papers, and its performance was compared to human classifications. We summarized the distribution of research fields and assessed the concordance between AI-generated and manual classifications.

Results: Medicine emerged as the predominant field among the top 100 most cited papers, accounting for 43 (43%), followed by education 26 (26%) and technology 15 (15%). Medical literature primarily focused on clinical applications of LLMs, limitations of AI in health care, and the role of AI in medical education. In education, research was centered around ethical concerns and potential applications of AI for teaching and learning. ChatGPT demonstrated variable concordance with human reviewers, achieving an agreement rating of 47% for research types and 92% for fields of study.

Conclusions: While LLMs such as ChatGPT exhibit considerable potential in aiding research categorization, human oversight remains essential to address issues such as hallucinations, outdated information, and biases in AI-generated outputs. This study highlights the transformative potential of LLMs across multiple sectors and emphasizes the importance of continuous ethical evaluation and iterative improvement of AI systems to maximize their benefits while minimizing risks.

背景：自从ChatGPT和其他大型语言模型（llm）发布以来，探索它们在各个领域（如医学、教育和技术）的能力和含义的学术出版物显著增加。目的：本研究旨在识别近一年来在不同专业领域发表的最具影响力的法学硕士学术著作，并对其研究类型和主题重点进行分类。该研究还评估了ChatGPT等人工智能（AI）工具对学术研究进行准确分类的能力。方法：利用Clarivate的Web of Science （WOS）进行文献计量学分析，提取被引频次前100位的法学硕士论文。论文按领域、期刊、作者和研究类型手工分类。ChatGPT-4被用于为相同的论文生成分类，并将其性能与人类分类进行比较。我们总结了研究领域的分布，并评估了人工智能生成的分类与人工分类之间的一致性。结果：在前100篇被引论文中，医学领域占主导地位，占43篇（43%），其次是教育26篇（26%），技术15篇（15%）。医学文献主要关注法学硕士的临床应用、人工智能在医疗保健中的局限性以及人工智能在医学教育中的作用。在教育方面，研究主要围绕伦理问题和人工智能在教学和学习方面的潜在应用。ChatGPT显示了与人类审稿人的可变一致性，在研究类型和研究领域的一致性评分分别达到47%和92%。结论：虽然像ChatGPT这样的法学硕士在帮助研究分类方面表现出相当大的潜力，但人类的监督对于解决诸如幻觉、过时信息和人工智能生成输出中的偏见等问题仍然至关重要。本研究强调了法学硕士在多个领域的变革潜力，并强调了持续的道德评估和人工智能系统的迭代改进的重要性，以最大限度地提高其收益，同时最大限度地降低风险。

{"title":"Identification and Categorization of the Top 100 Articles and the Future of Large Language Models: Thematic Analysis Using Bibliometric Analysis.","authors":"Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum","doi":"10.2196/68603","DOIUrl":"10.2196/68603","url":null,"abstract":"Background: Since the release of ChatGPT and other large language models (LLMs), there has been a significant increase in academic publications exploring their capabilities and implications across various fields, such as medicine, education, and technology.Objective: This study aims to identify the most influential academic works on LLMs published in the past year, categorize their research types and thematic focuses, within different professional fields. The study also evaluates the ability of artificial intelligence (AI) tools, such as ChatGPT, to accurately classify academic research.Methods: We conducted a bibliometric analysis using Clarivate's Web of Science (WOS) to extract the top 100 most cited papers on LLMs. Papers were manually categorized by field, journal, author, and research type. ChatGPT-4 was used to generate categorizations for the same papers, and its performance was compared to human classifications. We summarized the distribution of research fields and assessed the concordance between AI-generated and manual classifications.Results: Medicine emerged as the predominant field among the top 100 most cited papers, accounting for 43 (43%), followed by education 26 (26%) and technology 15 (15%). Medical literature primarily focused on clinical applications of LLMs, limitations of AI in health care, and the role of AI in medical education. In education, research was centered around ethical concerns and potential applications of AI for teaching and learning. ChatGPT demonstrated variable concordance with human reviewers, achieving an agreement rating of 47% for research types and 92% for fields of study.Conclusions: While LLMs such as ChatGPT exhibit considerable potential in aiding research categorization, human oversight remains essential to address issues such as hallucinations, outdated information, and biases in AI-generated outputs. This study highlights the transformative potential of LLMs across multiple sectors and emphasizes the importance of continuous ethical evaluation and iterative improvement of AI systems to maximize their benefits while minimizing risks.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68603"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative Evaluation. DeepSeek和GPT模型在儿科委员会准备问题中的表现：比较评价。

IF 2

JMIR AI

Pub Date : 2025-08-27 DOI: 10.2196/76056

Masab Mansoor, Andrew Ibrahim, Ali Hamide

Background: Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions.

Objective: The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.

Methods: We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates.

Results: DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.

Conclusions: DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.

背景：目前对人工智能（AI）在标准化儿科评估中的表现进行评估的研究有限。本研究评估了3种领先的人工智能模型在儿科委员会准备问题上的表现。目的：本研究的目的是评估和比较3种领先的大型语言模型（LLMs）在儿科委员会考试准备问题上的表现，并将其表现与人类医生基准进行对比。方法：我们使用2023 PREP自我评估中的266道选择题对DeepSeek-R1、ChatGPT-4和ChatGPT-4.5进行分析。研究结果与美国儿科委员会公布的首次通过率进行了比较。结果：DeepSeek-R1的正确率最高，为98.1%（261/266）。ChatGPT-4.5的准确率达到96.6%(257/266)，达到了人类性能的上限。ChatGPT-4的准确率为82.7%(220/266)，与人类通过率的较低范围相当。错误模式分析显示，人工智能模型最常见的问题是需要将复杂的临床表现与罕见疾病知识相结合。结论：DeepSeek-R1表现出卓越的性能，超过了典型的美国儿科委员会的通过率，表明在医学教育和临床支持方面的潜在应用，尽管需要进一步研究复杂的临床推理。

{"title":"Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative Evaluation.","authors":"Masab Mansoor, Andrew Ibrahim, Ali Hamide","doi":"10.2196/76056","DOIUrl":"10.2196/76056","url":null,"abstract":"Background: Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions.Objective: The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.Methods: We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates.Results: DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.Conclusions: DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76056"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intensive Care Unit Patient Outcome Prediction Using ν-Support Vector Classification and Stochastic Signal Processing-Based Feature Extraction Techniques: Algorithm Development and Validation Study. 基于ν-支持向量分类和随机信号处理的特征提取技术的重症监护病房患者预后预测：算法开发和验证研究。

IF 2

JMIR AI

Pub Date : 2025-08-26 DOI: 10.2196/72671

Shaodong Wang, Yiqun Jiang, Qing Li, Wenli Zhang

Background: Intensive care units (ICUs) treat patients with life-threatening illnesses. Worldwide, intensive care demand is massive. Predicting patient outcomes in ICUs holds significant importance for health care operation management. Nevertheless, it remains a challenging problem that researchers and health care practitioners have yet to overcome. While the newly emerging health digital trace data offer new possibilities, such data contain complex time series and patterns. Although researchers have devised severity score systems, traditional machine learning models with feature engineering, and deep learning models that use raw clinical data to predict ICU outcomes, existing methods have limitations.

Objective: This study aimed to develop a novel feature extraction and machine learning framework to repurpose and extract features with strong predictive power from patients' health digital traces for ICU outcome prediction.

Methods: Guided by signal processing techniques and medical domain knowledge, the proposed framework introduces a novel, signal processing-based feature engineering method to extract highly predictive features from ICU digital trace data. We rigorously evaluated this method on a real-world ICU dataset, demonstrating significant improvements over both traditional and deep learning baseline methods. The method was then evaluated using a real-world database to assess prediction accuracy and feature representativeness.

Results: The prediction results obtained by the proposed framework significantly outperformed state-of-the-art benchmarks. This demonstrated the framework's effectiveness in capturing key patterns from complex health digital traces for improving ICU outcome prediction.

Conclusions: Our study contributes to health care operation management by leveraging digital traces from health care information systems to address challenges with significant implications for health care.

背景：重症监护病房（icu）用于治疗危及生命的疾病。在世界范围内，重症监护的需求是巨大的。预测icu患者预后对医疗保健操作管理具有重要意义。然而，这仍然是一个具有挑战性的问题，研究人员和卫生保健从业人员尚未克服。虽然新出现的健康数字痕迹数据提供了新的可能性，但这些数据包含复杂的时间序列和模式。尽管研究人员已经设计出了严重程度评分系统、带有特征工程的传统机器学习模型，以及使用原始临床数据预测ICU结果的深度学习模型，但现有方法存在局限性。目的：本研究旨在开发一种新的特征提取和机器学习框架，从患者健康数字痕迹中重新利用和提取具有强预测能力的特征，用于ICU预后预测。方法：在信号处理技术和医学领域知识的指导下，该框架引入了一种新的基于信号处理的特征工程方法，从ICU数字痕迹数据中提取高预测性特征。我们在现实世界的ICU数据集上严格评估了该方法，证明了传统和深度学习基线方法的显着改进。然后使用真实世界的数据库对该方法进行评估，以评估预测准确性和特征代表性。结果：所提出的框架获得的预测结果显著优于最先进的基准。这证明了该框架在从复杂的健康数字痕迹中捕获关键模式以改善ICU结果预测方面的有效性。结论：我们的研究通过利用医疗保健信息系统的数字痕迹来解决对医疗保健具有重大影响的挑战，从而有助于医疗保健运营管理。

{"title":"Intensive Care Unit Patient Outcome Prediction Using ν-Support Vector Classification and Stochastic Signal Processing-Based Feature Extraction Techniques: Algorithm Development and Validation Study.","authors":"Shaodong Wang, Yiqun Jiang, Qing Li, Wenli Zhang","doi":"10.2196/72671","DOIUrl":"10.2196/72671","url":null,"abstract":"Background: Intensive care units (ICUs) treat patients with life-threatening illnesses. Worldwide, intensive care demand is massive. Predicting patient outcomes in ICUs holds significant importance for health care operation management. Nevertheless, it remains a challenging problem that researchers and health care practitioners have yet to overcome. While the newly emerging health digital trace data offer new possibilities, such data contain complex time series and patterns. Although researchers have devised severity score systems, traditional machine learning models with feature engineering, and deep learning models that use raw clinical data to predict ICU outcomes, existing methods have limitations.Objective: This study aimed to develop a novel feature extraction and machine learning framework to repurpose and extract features with strong predictive power from patients' health digital traces for ICU outcome prediction.Methods: Guided by signal processing techniques and medical domain knowledge, the proposed framework introduces a novel, signal processing-based feature engineering method to extract highly predictive features from ICU digital trace data. We rigorously evaluated this method on a real-world ICU dataset, demonstrating significant improvements over both traditional and deep learning baseline methods. The method was then evaluated using a real-world database to assess prediction accuracy and feature representativeness.Results: The prediction results obtained by the proposed framework significantly outperformed state-of-the-art benchmarks. This demonstrated the framework's effectiveness in capturing key patterns from complex health digital traces for improving ICU outcome prediction.Conclusions: Our study contributes to health care operation management by leveraging digital traces from health care information systems to address challenges with significant implications for health care.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72671"},"PeriodicalIF":2.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12421204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Domain-Specific Pretraining of NorDeClin-Bidirectional Encoder Representations From Transformers for International Statistical Classification of Diseases, Tenth Revision, Code Prediction in Norwegian Clinical Texts: Model Development and Evaluation Study. 来自国际疾病统计分类转换器的nordecline双向编码器表示的领域特定预训练，第十版，挪威临床文本中的代码预测：模型开发和评估研究。

IF 2

JMIR AI

Pub Date : 2025-08-25 DOI: 10.2196/66153

Phuong Dinh Ngo, Miguel Ángel Tejedor Hernández, Taridzo Chomutare, Andrius Budrionis, Therese Olsen Svenning, Torbjørn Torsvik, Anastasios Lamproudis, Hercules Dalianis

Background: Accurately assigning ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes is critical for clinical documentation, reimbursement processes, epidemiological studies, and health care planning. Manual coding is time-consuming, labor-intensive, and prone to errors, underscoring the need for automated solutions within the Norwegian health care system. Recent advances in natural language processing (NLP) and transformer-based language models have shown promising results in automating ICD (International Classification of Diseases) coding in several languages. However, prior work has focused primarily on English and other high-resource languages, leaving a gap in Norwegian-specific clinical NLP research.Objective: This study introduces 2 versions of NorDeClin-BERT (NorDeClin Bidirectional Encoder Representations from Transformers), domain-specific Norwegian BERT-based models pretrained on a large corpus of Norwegian clinical text to enhance their understanding of medical language. Both models were subsequently fine-tuned to predict ICD-10 diagnosis codes. We aimed to evaluate the impact of domain-specific pretraining and model size on classification performance and to compare NorDeClin-BERT with general-purpose and cross-lingual BERT models in the context of Norwegian ICD-10 coding.Methods: Two versions of NorDeClin-BERT were pretrained on the ClinCode Gastro Corpus, a large-scale dataset comprising 8.8 million deidentified Norwegian clinical notes, to enhance domain-specific language modeling. The base model builds upon NorBERT3-base and was pretrained on a large, relevant subset of the corpus, while the large model builds upon NorBERT3-large and was trained on the full dataset. Both models were benchmarked against SweDeClin-BERT, ScandiBERT, NorBERT3-base, and NorBERT3-large, using standard evaluation metrics: accuracy, precision, recall, and F1-score.Results: The results show that both versions of NorDeClin-BERT outperformed general-purpose Norwegian BERT models and Swedish clinical BERT models in classifying both prevalent and less common ICD-10 codes. Notably, NorDeClin-BERT-large achieved the highest overall performance across evaluation metrics, demonstrating the impact of domain-specific clinical pretraining in Norwegian. These results highlight that domain-specific pretraining on Norwegian clinical text, combined with model capacity, improves ICD-10 classification accuracy compared with general-domain Norwegian models and Swedish models pretrained on clinical text. Furthermore, while Swedish clinical models demonstrated some transferability to Norwegian, their performance remained suboptimal, emphasizing the necessity of Norwegian-specific clinical pretraining.Conclusions: This study highlights the potential of NorDeClin-BERT to improve ICD-10 code classification for the gastroenterology do

背景：准确分配ICD-10（国际疾病统计分类，第十版）代码对临床文件、报销流程、流行病学研究和卫生保健计划至关重要。手工编码既耗时又费力，而且容易出错，因此强调了在挪威医疗保健系统中需要自动化解决方案。自然语言处理（NLP）和基于转换器的语言模型的最新进展在几种语言的ICD（国际疾病分类）编码自动化方面显示出有希望的结果。然而，先前的工作主要集中在英语和其他资源丰富的语言上，在挪威特定的临床NLP研究中留下了空白。目的：本研究介绍了两个版本的NorDeClin- bert (NorDeClin Bidirectional Encoder Representations from Transformers)，这些基于特定领域的挪威语bert模型在一个大型挪威临床文本语料库上进行了预训练，以增强它们对医学语言的理解。随后对这两个模型进行微调以预测ICD-10诊断代码。我们的目的是评估特定领域预训练和模型大小对分类性能的影响，并将NorDeClin-BERT与挪威ICD-10编码背景下的通用和跨语言BERT模型进行比较。方法：两个版本的nordecline - bert在ClinCode Gastro语料库（一个包含880万未识别挪威临床记录的大型数据集）上进行预训练，以增强特定领域的语言建模。基本模型建立在NorBERT3-base基础上，并在语料库的一个大型相关子集上进行预训练，而大型模型建立在NorBERT3-large基础上，并在完整数据集上进行训练。使用标准评估指标：准确性、精密度、召回率和f1分数，对两种模型进行了SweDeClin-BERT、ScandiBERT、NorBERT3-base和NorBERT3-large的基准测试。结果：结果表明，两个版本的nordecline -BERT在分类流行和不常见的ICD-10代码方面都优于通用的挪威BERT模型和瑞典临床BERT模型。值得注意的是，nordecline - bert -large在评估指标中取得了最高的总体表现，证明了挪威语中特定领域临床预训练的影响。这些结果表明，与在临床文本上预训练的通用领域挪威模型和瑞典模型相比，针对挪威临床文本的特定领域预训练与模型容量相结合，提高了ICD-10分类精度。此外，虽然瑞典临床模型显示出挪威的一些可转移性，但它们的表现仍然不理想，强调了挪威特定临床预训练的必要性。结论：本研究强调了NorDeClin-BERT在改善挪威胃肠病学领域ICD-10编码分类方面的潜力，最终简化了临床文件、报告流程，减轻了行政负担，并提高了挪威卫生保健机构的编码准确性。基准评估建立了nordecline - bert作为处理挪威临床文本和预测ICD-10编码的最先进模型，为挪威医学NLP的未来研究建立了新的基线。未来的工作可能会进一步探索领域适应技术、外部知识整合和跨医院的通用性，以提高ICD编码在更广泛的临床环境中的性能。

{"title":"Domain-Specific Pretraining of NorDeClin-Bidirectional Encoder Representations From Transformers for International Statistical Classification of Diseases, Tenth Revision, Code Prediction in Norwegian Clinical Texts: Model Development and Evaluation Study.","authors":"Phuong Dinh Ngo, Miguel Ángel Tejedor Hernández, Taridzo Chomutare, Andrius Budrionis, Therese Olsen Svenning, Torbjørn Torsvik, Anastasios Lamproudis, Hercules Dalianis","doi":"10.2196/66153","DOIUrl":"10.2196/66153","url":null,"abstract":"Background: Accurately assigning ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes is critical for clinical documentation, reimbursement processes, epidemiological studies, and health care planning. Manual coding is time-consuming, labor-intensive, and prone to errors, underscoring the need for automated solutions within the Norwegian health care system. Recent advances in natural language processing (NLP) and transformer-based language models have shown promising results in automating ICD (International Classification of Diseases) coding in several languages. However, prior work has focused primarily on English and other high-resource languages, leaving a gap in Norwegian-specific clinical NLP research.Objective: This study introduces 2 versions of NorDeClin-BERT (NorDeClin Bidirectional Encoder Representations from Transformers), domain-specific Norwegian BERT-based models pretrained on a large corpus of Norwegian clinical text to enhance their understanding of medical language. Both models were subsequently fine-tuned to predict ICD-10 diagnosis codes. We aimed to evaluate the impact of domain-specific pretraining and model size on classification performance and to compare NorDeClin-BERT with general-purpose and cross-lingual BERT models in the context of Norwegian ICD-10 coding.Methods: Two versions of NorDeClin-BERT were pretrained on the ClinCode Gastro Corpus, a large-scale dataset comprising 8.8 million deidentified Norwegian clinical notes, to enhance domain-specific language modeling. The base model builds upon NorBERT3-base and was pretrained on a large, relevant subset of the corpus, while the large model builds upon NorBERT3-large and was trained on the full dataset. Both models were benchmarked against SweDeClin-BERT, ScandiBERT, NorBERT3-base, and NorBERT3-large, using standard evaluation metrics: accuracy, precision, recall, and F1-score.Results: The results show that both versions of NorDeClin-BERT outperformed general-purpose Norwegian BERT models and Swedish clinical BERT models in classifying both prevalent and less common ICD-10 codes. Notably, NorDeClin-BERT-large achieved the highest overall performance across evaluation metrics, demonstrating the impact of domain-specific clinical pretraining in Norwegian. These results highlight that domain-specific pretraining on Norwegian clinical text, combined with model capacity, improves ICD-10 classification accuracy compared with general-domain Norwegian models and Swedish models pretrained on clinical text. Furthermore, while Swedish clinical models demonstrated some transferability to Norwegian, their performance remained suboptimal, emphasizing the necessity of Norwegian-specific clinical pretraining.Conclusions: This study highlights the potential of NorDeClin-BERT to improve ICD-10 code classification for the gastroenterology do","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e66153"},"PeriodicalIF":2.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377785/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study. gpt - 40模型在心脏诊断中解释心电图图像的有效性：诊断准确性研究。

IF 2

JMIR AI

Pub Date : 2025-08-22 DOI: 10.2196/74426

Haya Engelstein, Roni Ramon-Gonen, Avi Sabbag, Eyal Klang, Karin Sudri, Michal Cohen-Shelly, Israel Barbash

Background: Recent progress has demonstrated the potential of deep learning models in analyzing electrocardiogram (ECG) pathologies. However, this method is intricate, expensive to develop, and designed for specific purposes. Large language models show promise in medical image interpretation, and yet their effectiveness in ECG analysis remains understudied. Generative Pretrained Transformer 4 Omni (GPT-4o), a multimodal artificial intelligence model, capable of processing images and text without task-specific training, may offer an accessible alternative.

Objective: This study aimed to evaluate GPT-4o's effectiveness in interpreting 12-lead ECGs, assessing classification accuracy, and exploring methods to enhance its performance.

Methods: A total of 6 common ECG diagnoses were evaluated: normal ECG, ST-segment elevation myocardial infarction, atrial fibrillation, right bundle branch block, left bundle branch block, and paced rhythm, with 30 normal ECGs and 10 of each abnormal pattern, totaling 80 cases. Deidentified ECGs were analyzed using OpenAI's GPT-4o. Our study used both zero-shot and few-shot learning methodologies to investigate three main scenarios: (1) ECG image recognition, (2) binary classification of normal versus abnormal ECGs, and (3) multiclass classification into 6 categories.

Results: The model excelled in recognizing ECG images, achieving an accuracy of 100%. In the classification of normal or abnormal ECG cases, the few-shot learning approach improved GPT-4o's accuracy by 30% from the baseline, reaching 83% (95% CI 81.8%-84.6%). However, multiclass classification for a specific pathology remained limited, achieving only 41% accuracy.

Conclusions: GPT-4o effectively differentiates normal from abnormal ECGs, suggesting its potential as an accessible artificial intelligence-assisted triage tool. Although limited in diagnosing specific cardiac conditions, GPT-4o's capability to interpret ECG images without specialized training highlights its potential for preliminary ECG interpretation in clinical and remote settings.

背景：最近的进展已经证明了深度学习模型在分析心电图（ECG）病理方面的潜力。然而，这种方法复杂，开发成本高，并且是为特定目的而设计的。大型语言模型在医学图像解释中显示出前景，但其在心电图分析中的有效性仍有待研究。生成式预训练Transformer 4 Omni （gpt - 40）是一种多模式人工智能模型，无需特定任务训练即可处理图像和文本，可能是一种可访问的替代方案。目的：本研究旨在评价gpt - 40在12导联心电图解释中的有效性，评估其分类准确性，并探索提高其性能的方法。方法：对心电图正常、st段抬高型心肌梗死、心房颤动、右束支传导阻滞、左束支传导阻滞、心律失常6项常见心电图诊断进行评价，其中正常心电图30例，各异常模式10例，共80例。使用OpenAI的gpt - 40分析鉴定的心电图。我们的研究使用零次和少次学习方法来研究三个主要场景：(1)心电图像识别；(2)正常与异常心电图的二值分类；(3)多类分类，分为6类。结果：该模型具有较好的心电图像识别能力，准确率达到100%。在ECG正常或异常病例的分类中，少射学习方法将gpt - 40的准确率从基线提高了30%，达到83% （95% CI 81.8%-84.6%）。然而，对特定病理的多分类仍然有限，准确率仅为41%。结论：gpt - 40可有效区分正常和异常心电图，提示其作为一种可获得的人工智能辅助分诊工具的潜力。虽然在诊断特定的心脏疾病方面有限，但gpt - 40在没有专门培训的情况下解释心电图图像的能力突出了其在临床和远程环境中进行初步心电图解释的潜力。

{"title":"Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study.","authors":"Haya Engelstein, Roni Ramon-Gonen, Avi Sabbag, Eyal Klang, Karin Sudri, Michal Cohen-Shelly, Israel Barbash","doi":"10.2196/74426","DOIUrl":"10.2196/74426","url":null,"abstract":"Background: Recent progress has demonstrated the potential of deep learning models in analyzing electrocardiogram (ECG) pathologies. However, this method is intricate, expensive to develop, and designed for specific purposes. Large language models show promise in medical image interpretation, and yet their effectiveness in ECG analysis remains understudied. Generative Pretrained Transformer 4 Omni (GPT-4o), a multimodal artificial intelligence model, capable of processing images and text without task-specific training, may offer an accessible alternative.Objective: This study aimed to evaluate GPT-4o's effectiveness in interpreting 12-lead ECGs, assessing classification accuracy, and exploring methods to enhance its performance.Methods: A total of 6 common ECG diagnoses were evaluated: normal ECG, ST-segment elevation myocardial infarction, atrial fibrillation, right bundle branch block, left bundle branch block, and paced rhythm, with 30 normal ECGs and 10 of each abnormal pattern, totaling 80 cases. Deidentified ECGs were analyzed using OpenAI's GPT-4o. Our study used both zero-shot and few-shot learning methodologies to investigate three main scenarios: (1) ECG image recognition, (2) binary classification of normal versus abnormal ECGs, and (3) multiclass classification into 6 categories.Results: The model excelled in recognizing ECG images, achieving an accuracy of 100%. In the classification of normal or abnormal ECG cases, the few-shot learning approach improved GPT-4o's accuracy by 30% from the baseline, reaching 83% (95% CI 81.8%-84.6%). However, multiclass classification for a specific pathology remained limited, achieving only 41% accuracy.Conclusions: GPT-4o effectively differentiates normal from abnormal ECGs, suggesting its potential as an accessible artificial intelligence-assisted triage tool. Although limited in diagnosing specific cardiac conditions, GPT-4o's capability to interpret ECG images without specialized training highlights its potential for preliminary ECG interpretation in clinical and remote settings.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e74426"},"PeriodicalIF":2.0,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12375907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0