JMIR Medical Informatics最新文献_第2页

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study. Qwen-2.5在全国护理执业资格考试中优于其他大型语言模型：回顾性横断面比较研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-10 DOI: 10.2196/63731

Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang

Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.

Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.

Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques.

Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F₁-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977.

Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.

背景：大型语言模型（LLMs）已被提出作为医学教育和实践的宝贵工具。中国国家护理执照考试（CNNLE）对法学硕士提出了独特的挑战，因为它要求深入的特定领域护理知识和做出复杂临床决策的能力，这与更一般的医学考试不同。然而，它们在CNNLE中的潜在应用仍未被探索。目的：本研究旨在评估GPT-3.5、GPT-4.0、gpt - 40、Copilot、ERNIE Bot-3.5、SPARK、Qwen-2.5 7种LLMs在CNNLE上的准确性，重点评估其处理特定领域护理知识和临床决策的能力。我们还探讨了使用机器学习技术结合他们的输出是否可以提高他们的整体准确性。方法：本回顾性横断面研究分析了2019年至2023年CNNLE中所有1200道选择题。7个llm在这些选择题上进行了评估，并使用了9个机器学习模型，包括逻辑回归、支持向量机、多层感知器、k近邻、随机森林、LightGBM、AdaBoost、XGBoost和CatBoost，通过集成技术优化整体性能。结果：Qwen-2.5的总体准确率最高，为88.9%，其次是gpt - 40（80.7%）、ERNIE Bot-3.5（78.1%）、GPT-4.0（70.3%）、SPARK（65.0%）和GPT-3.5（49.5%）。与大多数年份的专业实践部分相比，Qwen-2.5在实践技能部分表现出更高的准确性。它在简短的临床病例总结和涉及共享临床场景的问题上也表现良好。当7个llm的输出与9个机器学习模型相结合时，XGBoost产生了最好的性能，将准确率提高到90.8%。XGBoost的曲线下面积为0.961，灵敏度为0.905，特异性为0.978，f1评分为0.901，阳性预测值为0.901，阴性预测值为0.977。结论：本研究首次评估了7种llm在CNNLE上的性能，通过机器学习整合模型显著提高了准确率，达到90.8%。这些发现证明了法学硕士在改革医疗保健教育方面的变革潜力，并呼吁进一步研究以完善其能力并扩大其在考试准备和专业培训方面的影响。

{"title":"Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.","authors":"Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang","doi":"10.2196/63731","DOIUrl":"10.2196/63731","url":null,"abstract":"Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques.Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977.Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63731"},"PeriodicalIF":3.1,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis. 人工智能聊天机器人在超声检查中的表现：横断面比较分析。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-09 DOI: 10.2196/63924

Yong Zhang, Xiao Lu, Yan Luo, Ying Zhu, Wenwu Ling

Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic.

Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers.

Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel.

Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis.

Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.

背景：人工智能聊天机器人越来越多地用于医疗查询，特别是在超声医学领域。然而，他们的表现各不相同，并受到语言、问题类型和主题等因素的影响。目的：本研究旨在评估ChatGPT和ERNIE Bot在回答超声相关医学检查问题方面的表现，为用户和开发者提供见解。方法：从超声医学检查中整理出554个问题，涵盖多种题型和主题。这些问题是用英文和中文提出的。客观问题根据准确率评分，而主观问题由5名经验丰富的医生使用李克特量表评分。数据在Excel中进行分析。结果：在本次研究的554个问题中，选择题所占比例最大（354/554,64%），其次是简答题（69/554,12%）和名词解释（63/554,11%）。客观问题的准确率从8.33%到80%不等，其中真假问题得分最高。主观问题的接受率为47.62% ~ 75.36%。ERNIE Bot在许多方面都优于ChatGPT (pconclusion：聊天机器人可以提供有价值的超声相关答案，但性能因模型而异，受语言、问题类型和主题的影响。一般来说，ERNIE Bot优于ChatGPT。用户和开发人员应该了解模型的性能特征，并为不同的问题和语言选择合适的模型，以优化聊天机器人的使用。

{"title":"Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis.","authors":"Yong Zhang, Xiao Lu, Yan Luo, Ying Zhu, Wenwu Ling","doi":"10.2196/63924","DOIUrl":"https://doi.org/10.2196/63924","url":null,"abstract":"Background: Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic.Objective: This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers.Methods: We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel.Results: Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis.Conclusions: Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63924"},"PeriodicalIF":3.1,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737282/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143016966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Patients' Experienced Usability and Satisfaction With Digital Health Solutions in a Home Setting: Instrument Validation Study. 患者在家庭环境中对数字健康解决方案的可用性和满意度：仪器验证研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-08 DOI: 10.2196/63703

Susan J Oudbier, Ellen Ma Smets, Pythia T Nieuwkerk, David P Neal, S Azam Nurmohamed, Hans J Meij, Linda W Dusseljee-Peute

Background: The field of digital health solutions (DHS) has grown tremendously over the past years. DHS include tools for self-management, which support individuals to take charge of their own health. The usability of DHS, as experienced by patients, is pivotal to adoption. However, well-known questionnaires that evaluate usability and satisfaction use complex terminology derived from human-computer interaction and are therefore not well suited to assess experienced usability of patients using DHS in a home setting.

Objective: This study aimed to develop, validate, and assess an instrument that measures experienced usability and satisfaction of patients using DHS in a home setting.

Methods: The development of the "Experienced Usability and Satisfaction with Self-monitoring in the Home Setting" (GEMS) questionnaire followed several steps. Step I consisted of assessing the content validity, by conducting a literature review on current usability and satisfaction questionnaires, collecting statements and discussing these in an expert meeting, and translating each statement and adjusting it to the language level of the general population. This phase resulted in a draft version of the GEMS. Step II comprised assessing its face validity by pilot testing with Amsterdam University Medical Center's patient panel. In step III, psychometric analysis was conducted and the GEMS was assessed for reliability.

Results: A total of 14 items were included for psychometric analysis and resulted in 4 reliable scales: convenience of use, perceived value, efficiency of use, and satisfaction.

Conclusions: Overall, the GEMS questionnaire demonstrated its reliability and validity in assessing experienced usability and satisfaction of DHS in a home setting. Further refinement of the instrument is necessary to confirm its applicability in other patient populations in order to promote the development of a steering mechanism that can be applied longitudinally throughout implementation, and can be used as a benchmarking instrument.

背景：数字健康解决方案（DHS）领域在过去几年中取得了巨大的发展。卫生保健包括自我管理工具，这些工具支持个人对自己的健康负责。患者体验到的DHS的可用性是采用DHS的关键。然而，评估可用性和满意度的知名问卷使用来自人机交互的复杂术语，因此不太适合评估在家庭环境中使用DHS的患者的经验可用性。目的：本研究旨在开发、验证和评估一种测量在家庭环境中使用DHS的患者体验可用性和满意度的仪器。方法：采用“体验性可用性与家庭自我监控满意度”（GEMS）问卷编制方法。第一步包括评估内容效度，通过对当前可用性和满意度问卷进行文献回顾，收集陈述并在专家会议上讨论这些陈述，并翻译每个陈述并将其调整为一般人群的语言水平。这一阶段产生了GEMS的草案版本。第二步包括通过阿姆斯特丹大学医学中心患者小组的试点测试来评估其面部效度。在第三步，进行心理测量分析，并评估GEMS的信度。结果：共纳入14个项目进行心理测量分析，得到4个可靠量表：使用便利度、感知价值、使用效率和满意度。结论：总体而言，GEMS问卷在评估家庭环境中DHS的体验可用性和满意度方面证明了其信度和效度。有必要进一步完善该仪器，以确认其在其他患者群体中的适用性，从而促进指导机制的发展，该机制可以在整个实施过程中纵向应用，并可以用作基准工具。

{"title":"Patients' Experienced Usability and Satisfaction With Digital Health Solutions in a Home Setting: Instrument Validation Study.","authors":"Susan J Oudbier, Ellen Ma Smets, Pythia T Nieuwkerk, David P Neal, S Azam Nurmohamed, Hans J Meij, Linda W Dusseljee-Peute","doi":"10.2196/63703","DOIUrl":"10.2196/63703","url":null,"abstract":"Background: The field of digital health solutions (DHS) has grown tremendously over the past years. DHS include tools for self-management, which support individuals to take charge of their own health. The usability of DHS, as experienced by patients, is pivotal to adoption. However, well-known questionnaires that evaluate usability and satisfaction use complex terminology derived from human-computer interaction and are therefore not well suited to assess experienced usability of patients using DHS in a home setting.Objective: This study aimed to develop, validate, and assess an instrument that measures experienced usability and satisfaction of patients using DHS in a home setting.Methods: The development of the \"Experienced Usability and Satisfaction with Self-monitoring in the Home Setting\" (GEMS) questionnaire followed several steps. Step I consisted of assessing the content validity, by conducting a literature review on current usability and satisfaction questionnaires, collecting statements and discussing these in an expert meeting, and translating each statement and adjusting it to the language level of the general population. This phase resulted in a draft version of the GEMS. Step II comprised assessing its face validity by pilot testing with Amsterdam University Medical Center's patient panel. In step III, psychometric analysis was conducted and the GEMS was assessed for reliability.Results: A total of 14 items were included for psychometric analysis and resulted in 4 reliable scales: convenience of use, perceived value, efficiency of use, and satisfaction.Conclusions: Overall, the GEMS questionnaire demonstrated its reliability and validity in assessing experienced usability and satisfaction of DHS in a home setting. Further refinement of the instrument is necessary to confirm its applicability in other patient populations in order to promote the development of a steering mechanism that can be applied longitudinally throughout implementation, and can be used as a benchmarking instrument.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63703"},"PeriodicalIF":3.1,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-06 DOI: 10.2196/63020

Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He

Background: Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key-bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.Objective: This study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.Methods: We integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework's performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.Results: Compared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro-F1-score of 0.838 and a macro-area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.Conclusions: These findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in

{"title":"Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.","authors":"Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He","doi":"10.2196/63020","DOIUrl":"https://doi.org/10.2196/63020","url":null,"abstract":"Background: Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key-bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.Objective: This study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.Methods: We integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework's performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.Results: Compared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro-F1-score of 0.838 and a macro-area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.Conclusions: These findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in ","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63020"},"PeriodicalIF":3.1,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11747532/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users. 使用ChatGPT 4.0开发和评估心理健康聊天机器人：韩国用户的混合方法用户体验研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-03 DOI: 10.2196/63538

Boyoung Kang, Munpyo Hong

Background: Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence-driven solutions.Objective: This study aimed to develop and evaluate the performance of HoMemeTown Dr. CareSam, an advanced cross-lingual chatbot using ChatGPT 4.0 (OpenAI) to provide seamless support in both English and Korean contexts. The chatbot was designed to address the need for more personalized and culturally sensitive mental health support identified in our previous work while providing an accessible and user-friendly interface for Korean young adults.Methods: We conducted a mixed methods pilot study with 20 Korean young adults aged 18 to 27 (mean 23.3, SD 1.96) years. The HoMemeTown Dr CareSam chatbot was developed using the GPT application programming interface, incorporating features such as a gratitude journal and risk detection. User satisfaction and chatbot performance were evaluated using quantitative surveys and qualitative feedback, with triangulation used to ensure the validity and robustness of findings through cross-verification of data sources. Comparative analyses were conducted with other large language models chatbots and existing digital therapy tools (Woebot [Woebot Health Inc] and Happify [Twill Inc]).Results: Users generally expressed positive views towards the chatbot, with positivity and support receiving the highest score on a 10-point scale (mean 9.0, SD 1.2), followed by empathy (mean 8.7, SD 1.6) and active listening (mean 8.0, SD 1.8). However, areas for improvement were noted in professionalism (mean 7.0, SD 2.0), complexity of content (mean 7.4, SD 2.0), and personalization (mean 7.4, SD 2.4). The chatbot demonstrated statistically significant performance differences compared with other large language models chatbots (F=3.27; P=.047), with more pronounced differences compared with Woebot and Happify (F=12.94; P<.001). Qualitative feedback highlighted the chatbot's strengths in providing empathetic responses and a user-friendly interface, while areas for improvement included response speed and the naturalness of Korean language responses.Conclusions: The HoMemeTown Dr CareSam chatbot shows potential as a cross-lingual mental health support tool, achieving high user satisfaction and demonstrating comparative advantages over existing digital interventions. However, the study's limited sample size and short-term nature necessitate further research. Future studies should include larger-scale clinical trials, enhanced risk detection features, and integration with existing health care systems to fully realize its potential in supporting mental well-

背景：心理健康聊天机器人已经成为一种有前途的工具，可以为有需要的个人提供可访问和方便的支持。在我们之前对韩国大学生孤独感和抑郁症的数字干预研究的基础上，本研究解决了所发现的局限性，并探索了更先进的人工智能驱动的解决方案。目的：本研究旨在开发和评估hometown Dr. CareSam的性能，这是一款先进的跨语言聊天机器人，使用ChatGPT 4.0 （OpenAI）提供英语和韩语上下文的无缝支持。该聊天机器人旨在满足我们之前工作中确定的对更加个性化和文化敏感的心理健康支持的需求，同时为韩国年轻人提供易于访问和用户友好的界面。方法：我们对20名年龄在18至27岁（平均23.3岁，标准差1.96）的韩国年轻人进行了一项混合方法的初步研究。hometown医生CareSam聊天机器人是使用GPT应用程序编程接口开发的，结合了感恩日记和风险检测等功能。通过定量调查和定性反馈来评估用户满意度和聊天机器人的性能，并通过对数据源的交叉验证使用三角测量来确保结果的有效性和稳健性。与其他大型语言模型、聊天机器人和现有数字治疗工具（Woebot [Woebot Health Inc .]和Happify [Twill Inc .]）进行了比较分析。结果：用户普遍对聊天机器人表达了积极的看法，在10分量表中，积极和支持得分最高（平均9.0分，SD 1.2分），其次是同情（平均8.7分，SD 1.6分）和积极倾听（平均8.0分，SD 1.8分）。然而，在专业性（平均7.0,SD 2.0）、内容复杂性（平均7.4,SD 2.0）和个性化（平均7.4,SD 2.4）方面指出了需要改进的领域。与其他大型语言模型聊天机器人相比，该聊天机器人表现出统计学上显著的性能差异(F=3.27；P= 0.047)，与Woebot和Happify相比差异更显著(F=12.94；结论：hometown的CareSam医生聊天机器人显示出作为一种跨语言心理健康支持工具的潜力，实现了很高的用户满意度，并展示了与现有数字干预措施相比的相对优势。然而，该研究样本量有限，短期性质，需要进一步研究。未来的研究应包括更大规模的临床试验，增强风险检测功能，并与现有的卫生保健系统整合，以充分发挥其在不同语言和文化背景下支持心理健康的潜力。

{"title":"Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users.","authors":"Boyoung Kang, Munpyo Hong","doi":"10.2196/63538","DOIUrl":"10.2196/63538","url":null,"abstract":"Background: Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence-driven solutions.Objective: This study aimed to develop and evaluate the performance of HoMemeTown Dr. CareSam, an advanced cross-lingual chatbot using ChatGPT 4.0 (OpenAI) to provide seamless support in both English and Korean contexts. The chatbot was designed to address the need for more personalized and culturally sensitive mental health support identified in our previous work while providing an accessible and user-friendly interface for Korean young adults.Methods: We conducted a mixed methods pilot study with 20 Korean young adults aged 18 to 27 (mean 23.3, SD 1.96) years. The HoMemeTown Dr CareSam chatbot was developed using the GPT application programming interface, incorporating features such as a gratitude journal and risk detection. User satisfaction and chatbot performance were evaluated using quantitative surveys and qualitative feedback, with triangulation used to ensure the validity and robustness of findings through cross-verification of data sources. Comparative analyses were conducted with other large language models chatbots and existing digital therapy tools (Woebot [Woebot Health Inc] and Happify [Twill Inc]).Results: Users generally expressed positive views towards the chatbot, with positivity and support receiving the highest score on a 10-point scale (mean 9.0, SD 1.2), followed by empathy (mean 8.7, SD 1.6) and active listening (mean 8.0, SD 1.8). However, areas for improvement were noted in professionalism (mean 7.0, SD 2.0), complexity of content (mean 7.4, SD 2.0), and personalization (mean 7.4, SD 2.4). The chatbot demonstrated statistically significant performance differences compared with other large language models chatbots (F=3.27; P=.047), with more pronounced differences compared with Woebot and Happify (F=12.94; P<.001). Qualitative feedback highlighted the chatbot's strengths in providing empathetic responses and a user-friendly interface, while areas for improvement included response speed and the naturalness of Korean language responses.Conclusions: The HoMemeTown Dr CareSam chatbot shows potential as a cross-lingual mental health support tool, achieving high user satisfaction and demonstrating comparative advantages over existing digital interventions. However, the study's limited sample size and short-term nature necessitate further research. Future studies should include larger-scale clinical trials, enhanced risk detection features, and integration with existing health care systems to fully realize its potential in supporting mental well-","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63538"},"PeriodicalIF":3.1,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142928333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis. 挖掘电子健康记录数据的大型语言模型的变革潜力：内容分析。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2025-01-02 DOI: 10.2196/58457

Amadeo Jesus Wals Zurita, Hector Miras Del Rio, Nerea Ugarte Ruiz de Aguirre, Cristina Nebrera Navarro, Maria Rubio Jimenez, David Muñoz Carmona, Carlos Miguez Sanchez

Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models.

Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives.

Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.

背景：在本研究中，我们评估了从自由文本临床报告中提取和构建信息的大型语言模型的准确性、效率和成本效益，特别是在肿瘤电子健康记录中识别和分类患者合并症方面。我们特别比较了gpt-3.5-turbo-1106和gpt-4-1106预览模型与专业人类评估器的性能。目的：我们特别比较了gpt-3.5-turbo-1106和gpt-4-1106预览模型与专业人类评估器的性能。方法：我们使用OpenAI应用程序编程接口实现了一个脚本，从250份个人病史报告中报告的合并症中提取JavaScript对象符号格式的结构化信息。这些报告是由5位放射肿瘤学专家手工审查的，每批50份。我们使用敏感性、特异性、精密度、准确度、f值、κ指数和McNemar测试等指标对结果进行了比较，此外还检查了人类和生成式预训练变压器（GPT）模型中错误的常见原因。结果：与医生相比，GPT-3.5模型在所有指标上的表现略低，尽管差异没有统计学意义（McNemar检验，P=.79）。GPT-4在几个关键指标上表现出明显的优势（McNemar测试，p）。结论：这项研究表明，通过精心设计的提示，所检查的大型语言模型可以匹配甚至超过医学专家，从复杂的临床报告中提取信息。它们在时间和成本上的卓越效率，以及与数据库的轻松集成，使它们成为大规模数据挖掘和现实世界证据生成的宝贵工具。

{"title":"The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.","authors":"Amadeo Jesus Wals Zurita, Hector Miras Del Rio, Nerea Ugarte Ruiz de Aguirre, Cristina Nebrera Navarro, Maria Rubio Jimenez, David Muñoz Carmona, Carlos Miguez Sanchez","doi":"10.2196/58457","DOIUrl":"10.2196/58457","url":null,"abstract":"Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models.Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives.Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e58457"},"PeriodicalIF":3.1,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11739723/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142923859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Clinical Decision Making by Predicting Readmission Risk in Patients With Heart Failure Using Machine Learning: Predictive Model Development Study. 利用机器学习预测心力衰竭患者再入院风险，提高临床决策能力：预测模型开发研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2024-12-31 DOI: 10.2196/58812

Xiangkui Jiang, Bingquan Wang

Background: Patients with heart failure frequently face the possibility of rehospitalization following an initial hospital stay, placing a significant burden on both patients and health care systems. Accurate predictive tools are crucial for guiding clinical decision-making and optimizing patient care. However, the effectiveness of existing models tailored specifically to the Chinese population is still limited.

Objective: This study aimed to formulate a predictive model for assessing the likelihood of readmission among patients diagnosed with heart failure.

Methods: In this study, we analyzed data from 1948 patients with heart failure in a hospital in Sichuan Province between 2016 and 2019. By applying 3 variable selection strategies, 29 relevant variables were identified. Subsequently, we constructed 6 predictive models using different algorithms: logistic regression, support vector machine, gradient boosting machine, Extreme Gradient Boosting, multilayer perception, and graph convolutional networks.

Results: The graph convolutional network model showed the highest prediction accuracy with an area under the receiver operating characteristic curve of 0.831, accuracy of 75%, sensitivity of 52.12%, and specificity of 90.25%.

Conclusions: The model crafted in this study proves its effectiveness in forecasting the likelihood of readmission among patients with heart failure, thus serving as a crucial reference for clinical decision-making.

背景：心力衰竭患者在初次住院后经常面临再次住院的可能性，这给患者和医疗保健系统都带来了沉重的负担。准确的预测工具对于指导临床决策和优化患者护理至关重要。然而，现有的专门为中国人口量身定制的模型的有效性仍然有限。目的：本研究旨在建立心力衰竭患者再入院可能性的预测模型。方法：在本研究中，我们分析了四川省某医院2016年至2019年期间1948例心力衰竭患者的数据。通过3种变量选择策略，确定了29个相关变量。随后，我们使用不同的算法构建了6个预测模型：逻辑回归、支持向量机、梯度增强机、极端梯度增强、多层感知和图卷积网络。结果：图卷积网络模型预测准确率最高，受试者工作特征曲线下面积为0.831，准确率为75%，灵敏度为52.12%，特异性为90.25%。结论：本研究建立的模型能够有效预测心衰患者再入院的可能性，为临床决策提供重要参考。

{"title":"Enhancing Clinical Decision Making by Predicting Readmission Risk in Patients With Heart Failure Using Machine Learning: Predictive Model Development Study.","authors":"Xiangkui Jiang, Bingquan Wang","doi":"10.2196/58812","DOIUrl":"10.2196/58812","url":null,"abstract":"Background: Patients with heart failure frequently face the possibility of rehospitalization following an initial hospital stay, placing a significant burden on both patients and health care systems. Accurate predictive tools are crucial for guiding clinical decision-making and optimizing patient care. However, the effectiveness of existing models tailored specifically to the Chinese population is still limited.Objective: This study aimed to formulate a predictive model for assessing the likelihood of readmission among patients diagnosed with heart failure.Methods: In this study, we analyzed data from 1948 patients with heart failure in a hospital in Sichuan Province between 2016 and 2019. By applying 3 variable selection strategies, 29 relevant variables were identified. Subsequently, we constructed 6 predictive models using different algorithms: logistic regression, support vector machine, gradient boosting machine, Extreme Gradient Boosting, multilayer perception, and graph convolutional networks.Results: The graph convolutional network model showed the highest prediction accuracy with an area under the receiver operating characteristic curve of 0.831, accuracy of 75%, sensitivity of 52.12%, and specificity of 90.25%.Conclusions: The model crafted in this study proves its effectiveness in forecasting the likelihood of readmission among patients with heart failure, thus serving as a crucial reference for clinical decision-making.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e58812"},"PeriodicalIF":3.1,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effectiveness of Outpatient Chronic Pain Management for Middle-Aged Patients by Internet Hospitals: Retrospective Cohort Study. 网络医院门诊治疗中年患者慢性疼痛的有效性：回顾性队列研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2024-12-30 DOI: 10.2196/54975

Ling Sang, Bixin Zheng, Xianzheng Zeng, Huizhen Liu, Qing Jiang, Maotong Liu, Chenyu Zhu, Maoying Wang, Zengwei Yi, Keyu Song, Li Song

Background: Chronic pain is widespread and carries a heavy disease burden, and there is a lack of effective outpatient pain management. As an emerging internet medical platform in China, internet hospitals have been successfully applied for the management of chronic diseases. There are also a certain number of patients with chronic pain that use internet hospitals for pain management. However, no studies have investigated the effectiveness of pain management via internet hospitals.

Objective: The aim of this retrospective cohort study was to explore the effectiveness of chronic pain management by internet hospitals and their advantages and disadvantages compared to traditional physical hospital visits.

Methods: This was a retrospective cohort study. Demographic information such as the patient's sex, age, and number of visits was obtained from the IT center. During the first and last patient visits, information on outcome variables such as the Brief Pain Inventory (BPI), medical satisfaction, medical costs, and adverse drug events was obtained through a telephone follow-up. All patients with chronic pain who had 3 or more visits (internet or offline) between September 2021, and February 2023, were included. The patients were divided into an internet hospital group and a physical hospital group, according to whether they had web-based or in-person consultations, respectively. To control for confounding variables, propensity score matching was used to match the two groups. Matching variables included age, sex, diagnosis, and number of clinic visits.

Results: A total of 122 people in the internet hospital group and 739 people in the physical hospital group met the inclusion criteria. After propensity score matching, 77 patients in each of the two groups were included in the analysis. There was not a significant difference in the quality of life (QOL; QOL assessment was part of the BPI scale) between the internet hospital group and the physical hospital group (P=.80), but the QOL of both groups of patients improved after pain management (internet hospital group: P<.001; physical hospital group: P=.001). There were no significant differences in the pain relief rate (P=.25) or the incidence of adverse events (P=.60) between the two groups. The total cost (P<.001) and treatment-related cost (P<.001) of the physical hospital group were higher than those of the internet hospital group. In addition, the degree of satisfaction in the internet hospital group was greater than that in the physical hospital group (P=.01).

Conclusions: Internet hospitals are an effective way of managing chronic pain. They can improve patients' QOL and satisfaction, reduce treatment costs, and can be used as part of a multimodal strategy for chronic pain self-management.

背景：慢性疼痛是一种普遍存在的疾病负担，缺乏有效的门诊疼痛管理。互联网医院作为中国新兴的互联网医疗平台，已经成功应用于慢性病的管理。也有一定数量的慢性疼痛患者使用网络医院进行疼痛管理。然而，没有研究调查了通过互联网医院进行疼痛管理的有效性。目的：本回顾性队列研究旨在探讨网络医院治疗慢性疼痛的有效性及其与传统实体医院就诊的优缺点。方法：回顾性队列研究。从IT中心获得患者的性别、年龄、就诊次数等人口统计信息。在第一次和最后一次患者就诊期间，通过电话随访获得了诸如简短疼痛量表（BPI）、医疗满意度、医疗费用和药物不良事件等结果变量的信息。所有在2021年9月至2023年2月期间就诊3次或3次以上（在线或离线）的慢性疼痛患者均被纳入研究。根据患者是否进行了网络咨询或面对面咨询，患者分别被分为互联网医院组和实体医院组。为了控制混杂变量，使用倾向评分匹配来匹配两组。匹配变量包括年龄、性别、诊断和就诊次数。结果：网络医院组有122人符合入选标准，实体医院组有739人符合入选标准。倾向评分匹配后，两组各77例患者纳入分析。两组患者的生活质量（QOL）无显著差异；网络医院组和实体医院组患者的生活质量评价为BPI量表的一部分（P= 0.80），但两组患者在进行疼痛管理后生活质量均有改善（网络医院组：P）。结论：网络医院是治疗慢性疼痛的有效途径。它们可以提高患者的生活质量和满意度，降低治疗成本，并可作为慢性疼痛自我管理的多模式策略的一部分。

{"title":"Effectiveness of Outpatient Chronic Pain Management for Middle-Aged Patients by Internet Hospitals: Retrospective Cohort Study.","authors":"Ling Sang, Bixin Zheng, Xianzheng Zeng, Huizhen Liu, Qing Jiang, Maotong Liu, Chenyu Zhu, Maoying Wang, Zengwei Yi, Keyu Song, Li Song","doi":"10.2196/54975","DOIUrl":"https://doi.org/10.2196/54975","url":null,"abstract":"Background: Chronic pain is widespread and carries a heavy disease burden, and there is a lack of effective outpatient pain management. As an emerging internet medical platform in China, internet hospitals have been successfully applied for the management of chronic diseases. There are also a certain number of patients with chronic pain that use internet hospitals for pain management. However, no studies have investigated the effectiveness of pain management via internet hospitals.Objective: The aim of this retrospective cohort study was to explore the effectiveness of chronic pain management by internet hospitals and their advantages and disadvantages compared to traditional physical hospital visits.Methods: This was a retrospective cohort study. Demographic information such as the patient's sex, age, and number of visits was obtained from the IT center. During the first and last patient visits, information on outcome variables such as the Brief Pain Inventory (BPI), medical satisfaction, medical costs, and adverse drug events was obtained through a telephone follow-up. All patients with chronic pain who had 3 or more visits (internet or offline) between September 2021, and February 2023, were included. The patients were divided into an internet hospital group and a physical hospital group, according to whether they had web-based or in-person consultations, respectively. To control for confounding variables, propensity score matching was used to match the two groups. Matching variables included age, sex, diagnosis, and number of clinic visits.Results: A total of 122 people in the internet hospital group and 739 people in the physical hospital group met the inclusion criteria. After propensity score matching, 77 patients in each of the two groups were included in the analysis. There was not a significant difference in the quality of life (QOL; QOL assessment was part of the BPI scale) between the internet hospital group and the physical hospital group (P=.80), but the QOL of both groups of patients improved after pain management (internet hospital group: P<.001; physical hospital group: P=.001). There were no significant differences in the pain relief rate (P=.25) or the incidence of adverse events (P=.60) between the two groups. The total cost (P<.001) and treatment-related cost (P<.001) of the physical hospital group were higher than those of the internet hospital group. In addition, the degree of satisfaction in the internet hospital group was greater than that in the physical hospital group (P=.01).Conclusions: Internet hospitals are an effective way of managing chronic pain. They can improve patients' QOL and satisfaction, reduce treatment costs, and can be used as part of a multimodal strategy for chronic pain self-management.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e54975"},"PeriodicalIF":3.1,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of a Telemonitoring System Using Electronic National Early Warning Scores for Patients Receiving Medical Home Care: Pilot Implementation Study. 使用电子国家早期预警评分对接受家庭医疗护理的患者进行远程监测系统的评估：试点实施研究。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2024-12-26 DOI: 10.2196/63425

Cheng-Fu Lin, Pei-Jung Chang, Hui-Min Chang, Ching-Tsung Chen, Pi-Shan Hsu, Chieh-Liang Wu, Shih-Yi Lin

Background: Telehealth programs and wearable sensors that enable patients to monitor their vital signs have expanded due to the COVID-19 pandemic. The electronic National Early Warning Score (e-NEWS) system helps identify and respond to acute illness.

Objective: This study aimed to implement and evaluate a comprehensive telehealth system to monitor vital signs using e-NEWS for patients receiving integrated home-based medical care (iHBMC). The goal was to improve the early detection of patient deterioration and enhance care delivery in home settings. The system was deployed to optimize remote monitoring in iHBMC and reduce emergency visits and hospitalizations.

Methods: The study was conducted at a medical center and its affiliated home health agency in central Taiwan from November 1, 2022, to October 31, 2023. Patients eligible for iHBMC were enrolled, and sensor data from devices such as blood pressure monitors, thermometers, and pulse oximeters were transmitted to a cloud-based server for e-NEWS calculations at least twice per day over a 2-week period. Patients with e-NEWSs up to 4 received nursing or physician recommendations and interventions based on abnormal physiological data, with reassessment occurring after 2 hours.

Unlabelled: A total of 28 participants were enrolled, with a median age of 84.5 (IQR 79.3-90.8) years, and 32% (n=9) were male. All participants had caregivers, with only 5 out of 28 (18%) able to make decisions independently. The system was implemented across one medical center and its affiliated home health agency. Of the 28 participants, 27 completed the study, while 1 exited early due to low blood pressure and shortness of breath. The median e-NEWS value was 4 (IQR 3-6), with 397 abnormal readings recorded. Of the remaining 27 participants, 8 participants had earlier home visits due to abnormal readings, 6 required hypertension medication adjustments, and 9 received advice on oxygen supplementation. Overall, 24 out of 28 (86%) participants reported being satisfied with the system.

Conclusions: This study demonstrated the feasibility of implementing a telehealth system integrated with e-NEWS in iHBMC settings, potentially aiding in the early detection of clinical deterioration. Although caregivers receive training and resources for their tasks, the system may increase their workload, which could lead to higher stress levels. The small sample size, short monitoring duration, and regional focus in central Taiwan may further limit the applicability of the findings to areas with differing countries, regions, and health care infrastructures. Further research is required to confirm its impact.

背景：由于COVID-19大流行，使患者能够监测其生命体征的远程医疗项目和可穿戴传感器得到了扩展。电子国家早期预警评分（e-NEWS）系统有助于识别和应对急性疾病。目的：本研究旨在为接受综合家庭医疗护理（iHBMC）的患者建立一个综合远程医疗系统，并对其生命体征监测进行评估。其目标是改善对患者病情恶化的早期发现，并加强家庭环境中的护理服务。该系统的部署是为了优化iHBMC的远程监测，减少急诊和住院。方法：研究于2022年11月1日至2023年10月31日在台湾中部某医疗中心及其附属家庭保健机构进行。纳入符合iHBMC条件的患者，来自血压监测仪、体温计和脉搏血氧仪等设备的传感器数据在2周内每天至少两次传输到基于云的服务器进行e-NEWS计算。e- news 4以下的患者接受基于异常生理数据的护理或医生建议和干预，并在2小时后进行重新评估。未标记：共有28名参与者入组，中位年龄为84.5 （IQR 79.3-90.8）岁，32% （n=9）为男性。所有参与者都有照顾者，28人中只有5人（18%）能够独立做出决定。该系统在一家医疗中心及其附属家庭保健机构中实施。在28名参与者中，27人完成了研究，而1人因低血压和呼吸短促而提前退出。e-NEWS值中位数为4 (IQR 3-6)，有397个异常读数。在其余27名参与者中，8名参与者因读数异常而进行了早期家访，6名参与者需要调整高血压药物，9名参与者接受了补充氧气的建议。总体而言，28名参与者中有24人（86%）对该系统表示满意。结论：本研究证明了在iHBMC设置中实施与e-NEWS集成的远程医疗系统的可行性，可能有助于早期发现临床恶化。尽管护理人员接受了相应的培训和资源，但该系统可能会增加他们的工作量，从而导致更高的压力水平。需要进一步的研究来证实其影响。

{"title":"Evaluation of a Telemonitoring System Using Electronic National Early Warning Scores for Patients Receiving Medical Home Care: Pilot Implementation Study.","authors":"Cheng-Fu Lin, Pei-Jung Chang, Hui-Min Chang, Ching-Tsung Chen, Pi-Shan Hsu, Chieh-Liang Wu, Shih-Yi Lin","doi":"10.2196/63425","DOIUrl":"10.2196/63425","url":null,"abstract":"Background: Telehealth programs and wearable sensors that enable patients to monitor their vital signs have expanded due to the COVID-19 pandemic. The electronic National Early Warning Score (e-NEWS) system helps identify and respond to acute illness.Objective: This study aimed to implement and evaluate a comprehensive telehealth system to monitor vital signs using e-NEWS for patients receiving integrated home-based medical care (iHBMC). The goal was to improve the early detection of patient deterioration and enhance care delivery in home settings. The system was deployed to optimize remote monitoring in iHBMC and reduce emergency visits and hospitalizations.Methods: The study was conducted at a medical center and its affiliated home health agency in central Taiwan from November 1, 2022, to October 31, 2023. Patients eligible for iHBMC were enrolled, and sensor data from devices such as blood pressure monitors, thermometers, and pulse oximeters were transmitted to a cloud-based server for e-NEWS calculations at least twice per day over a 2-week period. Patients with e-NEWSs up to 4 received nursing or physician recommendations and interventions based on abnormal physiological data, with reassessment occurring after 2 hours.Unlabelled: A total of 28 participants were enrolled, with a median age of 84.5 (IQR 79.3-90.8) years, and 32% (n=9) were male. All participants had caregivers, with only 5 out of 28 (18%) able to make decisions independently. The system was implemented across one medical center and its affiliated home health agency. Of the 28 participants, 27 completed the study, while 1 exited early due to low blood pressure and shortness of breath. The median e-NEWS value was 4 (IQR 3-6), with 397 abnormal readings recorded. Of the remaining 27 participants, 8 participants had earlier home visits due to abnormal readings, 6 required hypertension medication adjustments, and 9 received advice on oxygen supplementation. Overall, 24 out of 28 (86%) participants reported being satisfied with the system.Conclusions: This study demonstrated the feasibility of implementing a telehealth system integrated with e-NEWS in iHBMC settings, potentially aiding in the early detection of clinical deterioration. Although caregivers receive training and resources for their tasks, the system may increase their workload, which could lead to higher stress levels. The small sample size, short monitoring duration, and regional focus in central Taiwan may further limit the applicability of the findings to areas with differing countries, regions, and health care infrastructures. Further research is required to confirm its impact.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e63425"},"PeriodicalIF":3.1,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11694153/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Daily automated prediction of delirium risk in hospitalized patients: Model development and validation. 住院患者谵妄风险的每日自动预测：模型开发和验证。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics

Pub Date : 2024-12-25 DOI: 10.2196/60442

Kendrick Matthew Shaw, Yu-Ping Shao, Manohar Ghanta, Valdery Moura Junior, Eyal Y Kimchi, Timothy T Houle, Oluwaseun Akeju, Michael Brandon Westover

Background: Delirium is common in hospitalized patients and correlated with increased morbidity and mortality. Despite this, delirium is underdiagnosed, and many institutions do not have sufficient resources to consistently apply effective screening and prevention.

Objective: To develop a machine learning algorithm to identify patients at highest risk of delirium in the hospital each day in an automated fashion based on data available in the electronic medical record, reducing the barrier to large-scale delirium screening.

Methods: We developed and compared multiple machine learning models on a retrospective dataset of all hospitalized adult patients with recorded Confusion Assessment Method (CAM) screens at a major academic medical center from April 2nd, 2016 to January 16th 2019, comprising 23006 patients. The patient's age, gender, and all available laboratory values, vital signs, prior CAM screens, and medication administrations were used as potential predictors. Four machine learning approaches were investigated: logistic regression with L1-regularization, multilayer perceptrons, random forests, and boosted trees. Model development used 80% of the patients; the remaining 20% were reserved for testing the final models. Lab values, vital signs, medications, gender, and age were used to predict a positive CAM screen in the next 24 hours.

Results: The boosted tree model achieved the greatest predictive power, with a 0.92 area under the receiver operator characteristic curve (AUROC) (95% Confidence Interval (CI) 0.913-9.22), followed by the random forest at 0.91 (95% CI 0.909-0.918), multilayer perceptron at 0.86 (95% CI 0.850-0.861), and logistic regression at 0.85 (95% CI 0.841-0.852). These AUROCs decreased to 0.78-0.82 and 0.74-0.80 when limited to patients not currently or never delirious, respectively.

Conclusions: A boosted tree machine learning model was able to identify hospitalized patients at elevated risk for delirium in the next 24 hours. This may allow for automated delirium risk screening and more precise targeting of proven and investigational interventions to prevent delirium.

Clinicaltrial:

背景：谵妄在住院患者中很常见，并与发病率和死亡率增加相关。尽管如此，谵妄的诊断不足，许多机构没有足够的资源来持续应用有效的筛查和预防。目的：开发一种机器学习算法，基于电子病历中可用的数据，以自动化的方式识别医院每天谵妄风险最高的患者，减少大规模谵妄筛查的障碍。方法：我们在一个回顾性数据集上开发并比较了多个机器学习模型，该数据集包括2016年4月2日至2019年1月16日在一家主要学术医疗中心记录混淆评估方法（CAM）筛查的所有住院成年患者，包括23006名患者。患者的年龄、性别和所有可用的实验室值、生命体征、先前的CAM筛查和药物管理被用作潜在的预测因素。研究了四种机器学习方法：逻辑回归与l1正则化、多层感知器、随机森林和增强树。模型开发使用了80%的患者；剩下的20%用于测试最终模型。使用实验室值、生命体征、药物、性别和年龄预测未来24小时CAM筛查阳性。结果：增强树模型获得了最大的预测能力，在接收者算子特征曲线（AUROC）下的面积为0.92(95%置信区间（CI） 0.913-9.22)，其次是随机森林（0.91）（95% CI 0.909-0.918），多层感知器（0.86）（95% CI 0.850-0.861），逻辑回归（0.85）（95% CI 0.841-0.852）。这些auroc分别下降到0.78-0.82和0.74-0.80，当仅限于目前没有谵妄或从未谵妄的患者时。结论：增强的树机器学习模型能够识别在接下来的24小时内谵妄风险升高的住院患者。这可能允许自动谵妄风险筛查和更精确的目标证明和研究干预措施，以防止谵妄。临床试验:

{"title":"Daily automated prediction of delirium risk in hospitalized patients: Model development and validation.","authors":"Kendrick Matthew Shaw, Yu-Ping Shao, Manohar Ghanta, Valdery Moura Junior, Eyal Y Kimchi, Timothy T Houle, Oluwaseun Akeju, Michael Brandon Westover","doi":"10.2196/60442","DOIUrl":"https://doi.org/10.2196/60442","url":null,"abstract":"Background: Delirium is common in hospitalized patients and correlated with increased morbidity and mortality. Despite this, delirium is underdiagnosed, and many institutions do not have sufficient resources to consistently apply effective screening and prevention.Objective: To develop a machine learning algorithm to identify patients at highest risk of delirium in the hospital each day in an automated fashion based on data available in the electronic medical record, reducing the barrier to large-scale delirium screening.Methods: We developed and compared multiple machine learning models on a retrospective dataset of all hospitalized adult patients with recorded Confusion Assessment Method (CAM) screens at a major academic medical center from April 2nd, 2016 to January 16th 2019, comprising 23006 patients. The patient's age, gender, and all available laboratory values, vital signs, prior CAM screens, and medication administrations were used as potential predictors. Four machine learning approaches were investigated: logistic regression with L1-regularization, multilayer perceptrons, random forests, and boosted trees. Model development used 80% of the patients; the remaining 20% were reserved for testing the final models. Lab values, vital signs, medications, gender, and age were used to predict a positive CAM screen in the next 24 hours.Results: The boosted tree model achieved the greatest predictive power, with a 0.92 area under the receiver operator characteristic curve (AUROC) (95% Confidence Interval (CI) 0.913-9.22), followed by the random forest at 0.91 (95% CI 0.909-0.918), multilayer perceptron at 0.86 (95% CI 0.850-0.861), and logistic regression at 0.85 (95% CI 0.841-0.852). These AUROCs decreased to 0.78-0.82 and 0.74-0.80 when limited to patients not currently or never delirious, respectively.Conclusions: A boosted tree machine learning model was able to identify hospitalized patients at elevated risk for delirium in the next 24 hours. This may allow for automated delirium risk screening and more precise targeting of proven and investigational interventions to prevent delirium.Clinicaltrial: ","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142899433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0