JMIR AI

Pub Date : 2024-08-30 DOI: 10.2196/54449

Sedigh Khademi, Christopher Palmer, Muhammad Javed, Gerardo Luis Dimaguila, Hazel Clothier, Jim Buttery, Jim Black

Background: Collecting information on adverse events following immunization from as many sources as possible is critical for promptly identifying potential safety concerns and taking appropriate actions. Febrile convulsions are recognized as an important potential reaction to vaccination in children aged <6 years.

Objective: The primary aim of this study was to evaluate the performance of natural language processing techniques and machine learning (ML) models for the rapid detection of febrile convulsion presentations in emergency departments (EDs), especially with respect to the minimum training data requirements to obtain optimum model performance. In addition, we examined the deployment requirements for a ML model to perform real-time monitoring of ED triage notes.

Methods: We developed a pattern matching approach as a baseline and evaluated ML models for the classification of febrile convulsions in ED triage notes to determine both their training requirements and their effectiveness in detecting febrile convulsions. We measured their performance during training and then compared the deployed models' result on new incoming ED data.

Results: Although the best standard neural networks had acceptable performance and were low-resource models, transformer-based models outperformed them substantially, justifying their ongoing deployment.

Conclusions: Using natural language processing, particularly with the use of large language models, offers significant advantages in syndromic surveillance. Large language models make highly effective classifiers, and their text generation capacity can be used to enhance the quality and diversity of training data.

背景：从尽可能多的渠道收集免疫接种后不良反应的信息对于及时发现潜在的安全问题并采取适当的措施至关重要。热性惊厥被认为是目标年龄儿童接种疫苗后的一个重要潜在反应：本研究的主要目的是评估自然语言处理技术和机器学习（ML）模型在快速检测急诊科（ED）中发热惊厥表现方面的性能，尤其是在获得最佳模型性能所需的最低训练数据方面。此外，我们还研究了实时监控急诊科分诊记录的 ML 模型的部署要求：方法：我们开发了一种模式匹配方法作为基线，并评估了用于对急诊室分诊记录中的发热惊厥进行分类的 ML 模型，以确定其训练要求和检测发热惊厥的有效性。我们测量了这些模型在训练过程中的表现，然后比较了已部署模型在新收到的急诊室数据上的结果：结果：尽管最佳标准神经网络具有可接受的性能，而且是低资源模型，但基于转换器的模型的性能大大优于它们，因此有理由继续部署这些模型：结论：使用自然语言处理，尤其是使用大型语言模型，在综合症监测方面具有显著优势。大型语言模型是高效的分类器，其文本生成能力可用于提高训练数据的质量和多样性。

{"title":"Near Real-Time Syndromic Surveillance of Emergency Department Triage Texts Using Natural Language Processing: Case Study in Febrile Convulsion Detection.","authors":"Sedigh Khademi, Christopher Palmer, Muhammad Javed, Gerardo Luis Dimaguila, Hazel Clothier, Jim Buttery, Jim Black","doi":"10.2196/54449","DOIUrl":"10.2196/54449","url":null,"abstract":"Background: Collecting information on adverse events following immunization from as many sources as possible is critical for promptly identifying potential safety concerns and taking appropriate actions. Febrile convulsions are recognized as an important potential reaction to vaccination in children aged <6 years.Objective: The primary aim of this study was to evaluate the performance of natural language processing techniques and machine learning (ML) models for the rapid detection of febrile convulsion presentations in emergency departments (EDs), especially with respect to the minimum training data requirements to obtain optimum model performance. In addition, we examined the deployment requirements for a ML model to perform real-time monitoring of ED triage notes.Methods: We developed a pattern matching approach as a baseline and evaluated ML models for the classification of febrile convulsions in ED triage notes to determine both their training requirements and their effectiveness in detecting febrile convulsions. We measured their performance during training and then compared the deployed models' result on new incoming ED data.Results: Although the best standard neural networks had acceptable performance and were low-resource models, transformer-based models outperformed them substantially, justifying their ongoing deployment.Conclusions: Using natural language processing, particularly with the use of large language models, offers significant advantages in syndromic surveillance. Large language models make highly effective classifiers, and their text generation capacity can be used to enhance the quality and diversity of training data.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e54449"},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11399745/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods. 获得预测慢性阻塞性肺病的最准确、最可解释的模型：多重线性回归和机器学习方法的三角分析。

JMIR AI

Pub Date : 2024-08-29 DOI: 10.2196/58455

Arnold Kamis, Nidhi Gadia, Zilin Luo, Shu Xin Ng, Mansi Thumbar

Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.

Objective: We gathered a diverse set of non-personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.

Methods: We integrated non-personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.

Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.

Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.

背景：肺病是美国的一个严重问题。尽管吸烟率不断下降，慢性阻塞性肺疾病（COPD）仍然是美国的健康负担。在本文中，我们重点关注 2016 年至 2019 年美国的慢性阻塞性肺病：我们从公共数据来源收集了各种非个人身份信息，以更好地了解和预测美国核心统计区（CBSA）一级的慢性阻塞性肺病发病率。我们的目标是比较线性模型和机器学习模型，以获得最准确、最可解释的慢性阻塞性肺病模型：我们整合了疾病控制和预防中心多个来源的非个人身份信息，并利用这些信息采用不同类型的方法分析慢性阻塞性肺病。我们将众所周知的致病因素--吸烟和种族/人种包括在内，因为美国不同种族和人种之间的健康差异也是众所周知的。模型还包括空气质量指数、教育、就业和经济变量。我们使用多元线性回归和机器学习方法对模型进行了拟合：最准确的多元线性回归模型的解释方差为 81.1%，平均绝对误差为 0.591，对称平均绝对百分比误差为 9.666。最准确的机器学习模型的方差解释率为 85.7%，平均绝对误差为 0.456，对称平均绝对百分比误差为 6.956。总体而言，吸烟和家庭收入是最强的预测变量。中等强度的预测变量包括教育水平和失业水平，以及美国印第安人或阿拉斯加原住民、黑人和西班牙裔人口的百分比，所有这些都是在 CBSA 层面上测量的：这项研究强调了使用多种数据来源和多种方法来了解和预测慢性阻塞性肺病的重要性。最准确的模型是梯度提升树，它捕捉到了模型中的非线性因素，其准确性优于最佳的多元线性回归。我们的可解释模型提出了一些方法，可将单个预测变量用于量身定制的干预措施，以降低特定人口和人种学社区的慢性阻塞性肺病发病率。在了解空气质量差对健康的影响（尤其是与气候变化的关系）方面存在的差距表明，有必要开展进一步研究，以设计干预措施和改善公共健康。

{"title":"Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods.","authors":"Arnold Kamis, Nidhi Gadia, Zilin Luo, Shu Xin Ng, Mansi Thumbar","doi":"10.2196/58455","DOIUrl":"10.2196/58455","url":null,"abstract":"Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.Objective: We gathered a diverse set of non-personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.Methods: We integrated non-personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e58455"},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393512/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Traditional Machine Learning, Deep Learning, and BERT (Large Language Model) Approaches for Predicting Hospitalizations From Nurse Triage Notes: Comparative Evaluation of Resource Management. 从护士分诊记录预测住院情况的传统机器学习、深度学习和 BERT（大型语言模型）方法：资源管理比较评估。

JMIR AI

Pub Date : 2024-08-27 DOI: 10.2196/52190

Dhavalkumar Patel, Prem Timsina, Larisa Gorenstein, Benjamin S Glicksberg, Ganesh Raut, Satya Narayan Cheetirala, Fabio Santana, Jules Tamegue, Arash Kia, Eyal Zimlichman, Matthew A Levin, Robert Freeman, Eyal Klang

Background: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints.

Objective: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)-based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency-inverse document frequency (TF-IDF). These choices represent different levels of computational requirements.

Methods: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals' data and externally validated on a fifth hospital's data.

Results: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and ~1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap.

Conclusions: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain.

International registered report identifier (irrid): RR2-10.1101/2023.08.07.23293699.

背景：从护士分诊记录中预测住院情况有可能增强护理效果。然而，要实现这一目标，需要慎重考虑选择何种模型。具体来说，医疗系统将面临不同程度的计算基础设施可用性和预算限制：为此，我们比较了基于深度学习、变换器双向编码器表征（BERT）的 Bio-Clinical-BERT 模型与包含词频-反向文档频率（TF-IDF）的词袋逻辑回归（BOW）模型的性能。这些选择代表了不同程度的计算要求：使用西奈山医疗系统急诊科就诊的 1,391,988 名患者的数据进行了回顾性分析，时间跨度为 2017 年至 2022 年。模型在 4 家医院的数据上进行了训练，并在第五家医院的数据上进行了外部验证：与 BOW-LR-TF-IDF 模型（0.81、0.83 和 0.84）相比，Bio-Clinical-BERT 模型在 10,000 人、100,000 人和约 1,000,000 人的训练集中分别获得了更高的接收者操作特征曲线下面积（0.82、0.84 和 0.85）。值得注意的是，尽管性能差距不大，但这两种模型在使用分诊记录进行预测方面都证明是有效的：我们的研究结果表明，BOW-LR-TF-IDF 等较简单的机器学习模型可以在资源有限的环境中充分发挥作用。考虑到对患者护理和医院资源管理的潜在影响，有必要进一步探索替代模型和技术，以提高这一关键领域的预测性能：RR2-10.1101/2023.08.07.23293699.

{"title":"Traditional Machine Learning, Deep Learning, and BERT (Large Language Model) Approaches for Predicting Hospitalizations From Nurse Triage Notes: Comparative Evaluation of Resource Management.","authors":"Dhavalkumar Patel, Prem Timsina, Larisa Gorenstein, Benjamin S Glicksberg, Ganesh Raut, Satya Narayan Cheetirala, Fabio Santana, Jules Tamegue, Arash Kia, Eyal Zimlichman, Matthew A Levin, Robert Freeman, Eyal Klang","doi":"10.2196/52190","DOIUrl":"10.2196/52190","url":null,"abstract":"Background: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints.Objective: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)-based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency-inverse document frequency (TF-IDF). These choices represent different levels of computational requirements.Methods: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals' data and externally validated on a fifth hospital's data.Results: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and ~1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap.Conclusions: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain.International registered report identifier (irrid): RR2-10.1101/2023.08.07.23293699.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e52190"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11387908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Machine Learning Applications in Pediatric Asthma Management: Scoping Review. 探索机器学习在儿科哮喘管理中的应用：范围综述。

JMIR AI

Pub Date : 2024-08-27 DOI: 10.2196/57983

Tanvi Ojha, Atushi Patel, Krishihan Sivapragasam, Radha Sharma, Tina Vosoughi, Becky Skidmore, Andrew D Pinto, Banafshe Hosseini

Background: The integration of machine learning (ML) in predicting asthma-related outcomes in children presents a novel approach in pediatric health care.

Objective: This scoping review aims to analyze studies published since 2019, focusing on ML algorithms, their applications, and predictive performances.

Methods: We searched Ovid MEDLINE ALL and Embase on Ovid, the Cochrane Library (Wiley), CINAHL (EBSCO), and Web of Science (core collection). The search covered the period from January 1, 2019, to July 18, 2023. Studies applying ML models in predicting asthma-related outcomes in children aged <18 years were included. Covidence was used for citation management, and the risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool.

Results: From 1231 initial articles, 15 met our inclusion criteria. The sample size ranged from 74 to 87,413 patients. Most studies used multiple ML techniques, with logistic regression (n=7, 47%) and random forests (n=6, 40%) being the most common. Key outcomes included predicting asthma exacerbations, classifying asthma phenotypes, predicting asthma diagnoses, and identifying potential risk factors. For predicting exacerbations, recurrent neural networks and XGBoost showed high performance, with XGBoost achieving an area under the receiver operating characteristic curve (AUROC) of 0.76. In classifying asthma phenotypes, support vector machines were highly effective, achieving an AUROC of 0.79. For diagnosis prediction, artificial neural networks outperformed logistic regression, with an AUROC of 0.63. To identify risk factors focused on symptom severity and lung function, random forests achieved an AUROC of 0.88. Sound-based studies distinguished wheezing from nonwheezing and asthmatic from normal coughs. The risk of bias assessment revealed that most studies (n=8, 53%) exhibited low to moderate risk, ensuring a reasonable level of confidence in the findings. Common limitations across studies included data quality issues, sample size constraints, and interpretability concerns.

Conclusions: This review highlights the diverse application of ML in predicting pediatric asthma outcomes, with each model offering unique strengths and challenges. Future research should address data quality, increase sample sizes, and enhance model interpretability to optimize ML utility in clinical settings for pediatric asthma management.

背景将机器学习（ML）应用于预测儿童哮喘相关结果是儿科医疗保健领域的一种新方法：本范围综述旨在分析 2019 年以来发表的研究，重点关注 ML 算法、其应用和预测性能：我们检索了 Ovid MEDLINE ALL 和 Embase on Ovid、Cochrane Library (Wiley)、CINAHL (EBSCO) 和 Web of Science (core collection)。检索时间为 2019 年 1 月 1 日至 2023 年 7 月 18 日。应用 ML 模型预测儿童哮喘相关结果的研究结果：从最初的 1231 篇文章中，有 15 篇符合我们的纳入标准。样本量从 74 到 87413 名患者不等。大多数研究使用了多种 ML 技术，其中最常见的是逻辑回归（7 篇，占 47%）和随机森林（6 篇，占 40%）。主要结果包括预测哮喘加重、哮喘表型分类、预测哮喘诊断和确定潜在风险因素。在预测病情恶化方面，递归神经网络和 XGBoost 表现出色，其中 XGBoost 的接收者工作特征曲线下面积 (AUROC) 为 0.76。在哮喘表型分类方面，支持向量机非常有效，AUROC 达到 0.79。在诊断预测方面，人工神经网络的表现优于逻辑回归，AUROC 为 0.63。在识别以症状严重程度和肺功能为重点的风险因素时，随机森林的 AUROC 为 0.88。基于声音的研究区分了喘息与非喘息以及哮喘性咳嗽与正常咳嗽。偏倚风险评估显示，大多数研究（8 项，53%）显示出低至中度偏倚风险，确保了研究结果具有合理的可信度。各项研究的共同局限性包括数据质量问题、样本量限制和可解释性问题：本综述强调了 ML 在预测小儿哮喘结果中的多样化应用，每种模型都具有独特的优势和挑战。未来的研究应解决数据质量问题、增加样本量并提高模型的可解释性，以优化 ML 在儿科哮喘管理临床环境中的应用。

{"title":"Exploring Machine Learning Applications in Pediatric Asthma Management: Scoping Review.","authors":"Tanvi Ojha, Atushi Patel, Krishihan Sivapragasam, Radha Sharma, Tina Vosoughi, Becky Skidmore, Andrew D Pinto, Banafshe Hosseini","doi":"10.2196/57983","DOIUrl":"10.2196/57983","url":null,"abstract":"Background: The integration of machine learning (ML) in predicting asthma-related outcomes in children presents a novel approach in pediatric health care.Objective: This scoping review aims to analyze studies published since 2019, focusing on ML algorithms, their applications, and predictive performances.Methods: We searched Ovid MEDLINE ALL and Embase on Ovid, the Cochrane Library (Wiley), CINAHL (EBSCO), and Web of Science (core collection). The search covered the period from January 1, 2019, to July 18, 2023. Studies applying ML models in predicting asthma-related outcomes in children aged <18 years were included. Covidence was used for citation management, and the risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool.Results: From 1231 initial articles, 15 met our inclusion criteria. The sample size ranged from 74 to 87,413 patients. Most studies used multiple ML techniques, with logistic regression (n=7, 47%) and random forests (n=6, 40%) being the most common. Key outcomes included predicting asthma exacerbations, classifying asthma phenotypes, predicting asthma diagnoses, and identifying potential risk factors. For predicting exacerbations, recurrent neural networks and XGBoost showed high performance, with XGBoost achieving an area under the receiver operating characteristic curve (AUROC) of 0.76. In classifying asthma phenotypes, support vector machines were highly effective, achieving an AUROC of 0.79. For diagnosis prediction, artificial neural networks outperformed logistic regression, with an AUROC of 0.63. To identify risk factors focused on symptom severity and lung function, random forests achieved an AUROC of 0.88. Sound-based studies distinguished wheezing from nonwheezing and asthmatic from normal coughs. The risk of bias assessment revealed that most studies (n=8, 53%) exhibited low to moderate risk, ensuring a reasonable level of confidence in the findings. Common limitations across studies included data quality issues, sample size constraints, and interpretability concerns.Conclusions: This review highlights the diverse application of ML in predicting pediatric asthma outcomes, with each model offering unique strengths and challenges. Future research should address data quality, increase sample sizes, and enhance model interpretability to optimize ML utility in clinical settings for pediatric asthma management.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e57983"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11387921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approaches for the Use of AI in Workplace Health Promotion and Prevention: Systematic Scoping Review. 在工作场所健康促进和预防中使用人工智能的方法：系统性范围审查。

JMIR AI

Pub Date : 2024-08-20 DOI: 10.2196/53506

Martin Lange, Alexandra Löwe, Ina Kayser, Andrea Schaller

Background: Artificial intelligence (AI) is an umbrella term for various algorithms and rapidly emerging technologies with huge potential for workplace health promotion and prevention (WHPP). WHPP interventions aim to improve people's health and well-being through behavioral and organizational measures or by minimizing the burden of workplace-related diseases and associated risk factors. While AI has been the focus of research in other health-related fields, such as public health or biomedicine, the transition of AI into WHPP research has yet to be systematically investigated.Objective: The systematic scoping review aims to comprehensively assess an overview of the current use of AI in WHPP. The results will be then used to point to future research directions. The following research questions were derived: (1) What are the study characteristics of studies on AI algorithms and technologies in the context of WHPP? (2) What specific WHPP fields (prevention, behavioral, and organizational approaches) were addressed by the AI algorithms and technologies? (3) What kind of interventions lead to which outcomes?Methods: A systematic scoping literature review (PRISMA-ScR [Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews]) was conducted in the 3 academic databases PubMed, Institute of Electrical and Electronics Engineers, and Association for Computing Machinery in July 2023, searching for papers published between January 2000 and December 2023. Studies needed to be (1) peer-reviewed, (2) written in English, and (3) focused on any AI-based algorithm or technology that (4) were conducted in the context of WHPP or (5) an associated field. Information on study design, AI algorithms and technologies, WHPP fields, and the patient or population, intervention, comparison, and outcomes framework were extracted blindly with Rayyan and summarized.Results: A total of 10 studies were included. Risk prevention and modeling were the most identified WHPP fields (n=6), followed by behavioral health promotion (n=4) and organizational health promotion (n=1). Further, 4 studies focused on mental health. Most AI algorithms were machine learning-based, and 3 studies used combined deep learning algorithms. AI algorithms and technologies were primarily implemented in smartphone apps (eg, in the form of a chatbot) or used the smartphone as a data source (eg, Global Positioning System). Behavioral approaches ranged from 8 to 12 weeks and were compared to control groups. Additionally, 3 studies evaluated the robustness and accuracy of an AI model or framework.Conclusions: Although AI has caught increasing attention in health-related research, the review reveals that AI in WHPP is marginally investigated. Our results indicate that AI is promising for individualization and risk prediction in WHPP, but current research does n

背景：人工智能（AI）是各种算法和快速新兴技术的总称，在工作场所健康促进和预防（WHPP）方面具有巨大潜力。工作场所健康促进和预防干预措施旨在通过行为和组织措施，或通过最大限度地减少工作场所相关疾病和相关风险因素的负担，改善人们的健康和福祉。虽然人工智能一直是公共卫生或生物医学等其他健康相关领域的研究重点，但人工智能向 WHPP 研究的转变尚未得到系统的调查：本系统性范围界定综述旨在全面评估当前人工智能在世界卫生组织世界人口方案中的应用概况。研究结果将用于指明未来的研究方向。我们提出了以下研究问题：(1) 在 WHPP 背景下，有关人工智能算法和技术的研究有哪些特点；(2) 人工智能算法和技术涉及哪些具体的 WHPP 领域（预防、行为和组织方法）；(3) 什么样的干预措施会产生什么样的结果？2023 年 7 月，我们在 PubMed、IEEE 和 ACM 这三个学术数据库中进行了一次系统的范围性文献综述（PRISMA-ScR），搜索 2000 年 1 月至 2023 年 12 月间发表的文章。研究必须：1）经过同行评议；2）用英语撰写；3）关注任何基于人工智能的算法或技术；4）在世界卫生组织世界人口政策背景下进行；5）相关领域。研究设计、人工智能算法和技术、WHPP领域以及PICO框架等信息均由Rayyan进行盲法提取并汇总：结果：共纳入 10 项研究。风险预防和建模是确定最多的 WHPP 领域（n=6），其次是行为健康促进（n=4）和组织健康促进（n=1）。有四项研究侧重于心理健康。大多数人工智能算法以机器学习为基础，有三项研究使用了综合深度学习算法。人工智能算法和技术主要在智能手机应用中实施（例如，以聊天机器人的形式），或使用智能手机作为数据源（例如，GPS）。行为方法从 8 周到 12 周不等，并与对照组进行了比较。三项研究评估了人工智能模型或框架的稳健性和准确性：尽管人工智能在与健康相关的研究中越来越受到关注，但综述显示，人工智能在 WHPP 中的应用研究还很少。我们的研究结果表明，人工智能有望在 WHPP 中实现个性化和风险预测，但目前的研究并未涵盖 WHPP 的范围。除此以外，未来的研究将受益于WHPP所有领域的广泛研究、纵向数据和报告指南：于2023年7月5日在开放科学框架[1]注册。

{"title":"Approaches for the Use of AI in Workplace Health Promotion and Prevention: Systematic Scoping Review.","authors":"Martin Lange, Alexandra Löwe, Ina Kayser, Andrea Schaller","doi":"10.2196/53506","DOIUrl":"10.2196/53506","url":null,"abstract":"Background: Artificial intelligence (AI) is an umbrella term for various algorithms and rapidly emerging technologies with huge potential for workplace health promotion and prevention (WHPP). WHPP interventions aim to improve people's health and well-being through behavioral and organizational measures or by minimizing the burden of workplace-related diseases and associated risk factors. While AI has been the focus of research in other health-related fields, such as public health or biomedicine, the transition of AI into WHPP research has yet to be systematically investigated.Objective: The systematic scoping review aims to comprehensively assess an overview of the current use of AI in WHPP. The results will be then used to point to future research directions. The following research questions were derived: (1) What are the study characteristics of studies on AI algorithms and technologies in the context of WHPP? (2) What specific WHPP fields (prevention, behavioral, and organizational approaches) were addressed by the AI algorithms and technologies? (3) What kind of interventions lead to which outcomes?Methods: A systematic scoping literature review (PRISMA-ScR [Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews]) was conducted in the 3 academic databases PubMed, Institute of Electrical and Electronics Engineers, and Association for Computing Machinery in July 2023, searching for papers published between January 2000 and December 2023. Studies needed to be (1) peer-reviewed, (2) written in English, and (3) focused on any AI-based algorithm or technology that (4) were conducted in the context of WHPP or (5) an associated field. Information on study design, AI algorithms and technologies, WHPP fields, and the patient or population, intervention, comparison, and outcomes framework were extracted blindly with Rayyan and summarized.Results: A total of 10 studies were included. Risk prevention and modeling were the most identified WHPP fields (n=6), followed by behavioral health promotion (n=4) and organizational health promotion (n=1). Further, 4 studies focused on mental health. Most AI algorithms were machine learning-based, and 3 studies used combined deep learning algorithms. AI algorithms and technologies were primarily implemented in smartphone apps (eg, in the form of a chatbot) or used the smartphone as a data source (eg, Global Positioning System). Behavioral approaches ranged from 8 to 12 weeks and were compared to control groups. Additionally, 3 studies evaluated the robustness and accuracy of an AI model or framework.Conclusions: Although AI has caught increasing attention in health-related research, the review reveals that AI in WHPP is marginally investigated. Our results indicate that AI is promising for individualization and risk prediction in WHPP, but current research does n","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e53506"},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11372327/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigating Sociodemographic Bias in Opioid Use Disorder Prediction: Fairness-Aware Machine Learning Framework. 减少阿片类药物使用障碍预测中的社会人口偏差：公平意识机器学习框架。

JMIR AI

Pub Date : 2024-08-20 DOI: 10.2196/55820

Mohammad Yaseliani, Md Noor-E-Alam, Md Mahmudul Hasan

Background: Opioid use disorder (OUD) is a critical public health crisis in the United States, affecting >5.5 million Americans in 2021. Machine learning has been used to predict patient risk of incident OUD. However, little is known about the fairness and bias of these predictive models.

Objective: The aims of this study are two-fold: (1) to develop a machine learning bias mitigation algorithm for sociodemographic features and (2) to develop a fairness-aware weighted majority voting (WMV) classifier for OUD prediction.

Methods: We used the 2020 National Survey on Drug and Health data to develop a neural network (NN) model using stochastic gradient descent (SGD; NN-SGD) and an NN model using Adam (NN-Adam) optimizers and evaluated sociodemographic bias by comparing the area under the curve values. A bias mitigation algorithm, based on equality of odds, was implemented to minimize disparities in specificity and recall. Finally, a WMV classifier was developed for fairness-aware prediction of OUD. To further analyze bias detection and mitigation, we did a 1-N matching of OUD to non-OUD cases, controlling for socioeconomic variables, and evaluated the performance of the proposed bias mitigation algorithm and WMV classifier.

Results: Our bias mitigation algorithm substantially reduced bias with NN-SGD, by 21.66% for sex, 1.48% for race, and 21.04% for income, and with NN-Adam by 16.96% for sex, 8.87% for marital status, 8.45% for working condition, and 41.62% for race. The fairness-aware WMV classifier achieved a recall of 85.37% and 92.68% and an accuracy of 58.85% and 90.21% using NN-SGD and NN-Adam, respectively. The results after matching also indicated remarkable bias reduction with NN-SGD and NN-Adam, respectively, as follows: sex (0.14% vs 0.97%), marital status (12.95% vs 10.33%), working condition (14.79% vs 15.33%), race (60.13% vs 41.71%), and income (0.35% vs 2.21%). Moreover, the fairness-aware WMV classifier achieved high performance with a recall of 100% and 85.37% and an accuracy of 73.20% and 89.38% using NN-SGD and NN-Adam, respectively.

Conclusions: The application of the proposed bias mitigation algorithm shows promise in reducing sociodemographic bias, with the WMV classifier confirming bias reduction and high performance in OUD prediction.

背景：阿片类药物使用障碍（OUD）是美国的一个严重公共卫生危机，2021 年将影响超过 550 万美国人。机器学习已被用于预测患者发生 OUD 的风险。然而，人们对这些预测模型的公平性和偏差知之甚少：本研究的目的有两个方面：（1）针对社会人口学特征开发一种机器学习偏差缓解算法；（2）为 OUD 预测开发一种公平感知加权多数投票（WMV）分类器：我们利用 2020 年全国毒品与健康调查数据，开发了一个使用随机梯度下降（SGD；NN-SGD）的神经网络（NN）模型和一个使用亚当（Adam）优化器的神经网络（NN）模型，并通过比较曲线下面积值评估了社会人口学偏差。在几率相等的基础上实施了偏差缓解算法，以尽量减少特异性和召回率的差异。最后，还开发了一种 WMV 分类器，用于公平地预测 OUD。为了进一步分析偏倚检测和缓解，我们对 OUD 和非 OUD 病例进行了 1-N 匹配，同时控制了社会经济变量，并评估了所提出的偏倚缓解算法和 WMV 分类器的性能：我们的偏差缓解算法大大减少了 NN-SGD 的偏差，性别偏差减少了 21.66%，种族偏差减少了 1.48%，收入偏差减少了 21.04%；NN-Adam 的性别偏差减少了 16.96%，婚姻状况偏差减少了 8.87%，工作条件偏差减少了 8.45%，种族偏差减少了 41.62%。使用 NN-SGD 和 NN-Adam 的公平感知 WMV 分类器的召回率分别为 85.37% 和 92.68%，准确率分别为 58.85% 和 90.21%。匹配后的结果也表明，NN-SGD 和 NN-Adam 分别在以下方面显著减少了偏差：性别（0.14% vs 0.97%）、婚姻状况（12.95% vs 10.33%）、工作条件（14.79% vs 15.33%）、种族（60.13% vs 41.71%）和收入（0.35% vs 2.21%）。此外，使用 NN-SGD 和 NN-Adam 的公平感知 WMV 分类器取得了很高的性能，召回率分别为 100%和 85.37%，准确率分别为 73.20%和 89.38%：结论：建议的偏差缓解算法的应用显示了减少社会人口偏差的前景，WMV 分类器证实了偏差的减少和 OUD 预测的高性能。

{"title":"Mitigating Sociodemographic Bias in Opioid Use Disorder Prediction: Fairness-Aware Machine Learning Framework.","authors":"Mohammad Yaseliani, Md Noor-E-Alam, Md Mahmudul Hasan","doi":"10.2196/55820","DOIUrl":"10.2196/55820","url":null,"abstract":"Background: Opioid use disorder (OUD) is a critical public health crisis in the United States, affecting >5.5 million Americans in 2021. Machine learning has been used to predict patient risk of incident OUD. However, little is known about the fairness and bias of these predictive models.Objective: The aims of this study are two-fold: (1) to develop a machine learning bias mitigation algorithm for sociodemographic features and (2) to develop a fairness-aware weighted majority voting (WMV) classifier for OUD prediction.Methods: We used the 2020 National Survey on Drug and Health data to develop a neural network (NN) model using stochastic gradient descent (SGD; NN-SGD) and an NN model using Adam (NN-Adam) optimizers and evaluated sociodemographic bias by comparing the area under the curve values. A bias mitigation algorithm, based on equality of odds, was implemented to minimize disparities in specificity and recall. Finally, a WMV classifier was developed for fairness-aware prediction of OUD. To further analyze bias detection and mitigation, we did a 1-N matching of OUD to non-OUD cases, controlling for socioeconomic variables, and evaluated the performance of the proposed bias mitigation algorithm and WMV classifier.Results: Our bias mitigation algorithm substantially reduced bias with NN-SGD, by 21.66% for sex, 1.48% for race, and 21.04% for income, and with NN-Adam by 16.96% for sex, 8.87% for marital status, 8.45% for working condition, and 41.62% for race. The fairness-aware WMV classifier achieved a recall of 85.37% and 92.68% and an accuracy of 58.85% and 90.21% using NN-SGD and NN-Adam, respectively. The results after matching also indicated remarkable bias reduction with NN-SGD and NN-Adam, respectively, as follows: sex (0.14% vs 0.97%), marital status (12.95% vs 10.33%), working condition (14.79% vs 15.33%), race (60.13% vs 41.71%), and income (0.35% vs 2.21%). Moreover, the fairness-aware WMV classifier achieved high performance with a recall of 100% and 85.37% and an accuracy of 73.20% and 89.38% using NN-SGD and NN-Adam, respectively.Conclusions: The application of the proposed bias mitigation algorithm shows promise in reducing sociodemographic bias, with the WMV classifier confirming bias reduction and high performance in OUD prediction.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e55820"},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11372321/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142010050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study. 评估人类与 ChatGPT 进行的文献综述：比较研究

JMIR AI

Pub Date : 2024-08-19 DOI: 10.2196/56537

Mehrnaz Mostafapour, Jacqueline H Fortier, Karen Pacheco, Heather Murray, Gary Garber

Background: With the rapid evolution of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT-4 (OpenAI), there is an increasing interest in their potential to assist in scholarly tasks, including conducting literature reviews. However, the efficacy of AI-generated reviews compared with traditional human-led approaches remains underexplored.

Objective: This study aims to compare the quality of literature reviews conducted by the ChatGPT-4 model with those conducted by human researchers, focusing on the relational dynamics between physicians and patients.

Methods: We included 2 literature reviews in the study on the same topic, namely, exploring factors affecting relational dynamics between physicians and patients in medicolegal contexts. One review used GPT-4, last updated in September 2021, and the other was conducted by human researchers. The human review involved a comprehensive literature search using medical subject headings and keywords in Ovid MEDLINE, followed by a thematic analysis of the literature to synthesize information from selected articles. The AI-generated review used a new prompt engineering approach, using iterative and sequential prompts to generate results. Comparative analysis was based on qualitative measures such as accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency.

Results: GPT-4 produced an extensive list of relational factors rapidly. The AI model demonstrated an impressive breadth of knowledge but exhibited limitations in in-depth and contextual understanding, occasionally producing irrelevant or incorrect information. In comparison, human researchers provided a more nuanced and contextually relevant review. The comparative analysis assessed the reviews based on criteria including accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. While GPT-4 showed advantages in response time and breadth of knowledge, human-led reviews excelled in accuracy, depth of knowledge, and contextual understanding.

Conclusions: The study suggests that GPT-4, with structured prompt engineering, can be a valuable tool for conducting preliminary literature reviews by providing a broad overview of topics quickly. However, its limitations necessitate careful expert evaluation and refinement, making it an assistant rather than a substitute for human expertise in comprehensive literature reviews. Moreover, this research highlights the potential and limitations of using AI tools like GPT-4 in academic research, particularly in the fields of health services and medical research. It underscores the necessity of combining AI's rapid information retrieval capabilities with human expertise for more accurate and contextually rich scholarly outputs.

背景：随着人工智能（AI）的快速发展，尤其是大型语言模型（LLM），如 ChatGPT-4 (OpenAI)，人们对其在学术任务（包括进行文献综述）中的辅助潜力越来越感兴趣。然而，人工智能生成的综述与传统的人类主导方法相比，其功效仍未得到充分探索：本研究旨在比较 ChatGPT-4 模型与人类研究人员进行的文献综述的质量，重点关注医生与患者之间的关系动态：我们在研究中纳入了两篇相同主题的文献综述，即探讨在医疗法律背景下影响医患关系动态的因素。其中一篇综述使用了 2021 年 9 月最后一次更新的 GPT-4，另一篇由人类研究人员进行。人工综述包括使用 Ovid MEDLINE 中的医学主题词和关键词进行全面的文献检索，然后对文献进行专题分析，以综合所选文章中的信息。人工智能生成的综述采用了一种新的提示工程方法，使用迭代和顺序提示来生成结果。比较分析基于定性衡量标准，如准确性、响应时间、一致性、知识的广度和深度、背景理解和透明度：结果：GPT-4 快速生成了一份广泛的关系因素清单。人工智能模型展示了令人印象深刻的知识广度，但在深度和上下文理解方面表现出局限性，偶尔会产生不相关或不正确的信息。相比之下，人类研究人员提供的评论更加细致入微，与上下文更加相关。比较分析根据准确性、响应时间、一致性、知识的广度和深度、对上下文的理解以及透明度等标准对审查进行了评估。虽然 GPT-4 在响应时间和知识广度方面表现出优势，但人类主导的审查在准确性、知识深度和语境理解方面表现出色：研究表明，GPT-4 与结构化提示工程相结合，可以快速提供广泛的主题概述，是进行初步文献综述的重要工具。但是，由于其局限性，有必要对其进行仔细的专家评估和改进，使其成为全面文献综述中的辅助工具，而不是人类专业知识的替代品。此外，本研究还强调了在学术研究中使用 GPT-4 等人工智能工具的潜力和局限性，尤其是在医疗服务和医学研究领域。它强调了将人工智能的快速信息检索能力与人类专业知识相结合的必要性，以便获得更准确、背景更丰富的学术成果。

{"title":"Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study.","authors":"Mehrnaz Mostafapour, Jacqueline H Fortier, Karen Pacheco, Heather Murray, Gary Garber","doi":"10.2196/56537","DOIUrl":"10.2196/56537","url":null,"abstract":"Background: With the rapid evolution of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT-4 (OpenAI), there is an increasing interest in their potential to assist in scholarly tasks, including conducting literature reviews. However, the efficacy of AI-generated reviews compared with traditional human-led approaches remains underexplored.Objective: This study aims to compare the quality of literature reviews conducted by the ChatGPT-4 model with those conducted by human researchers, focusing on the relational dynamics between physicians and patients.Methods: We included 2 literature reviews in the study on the same topic, namely, exploring factors affecting relational dynamics between physicians and patients in medicolegal contexts. One review used GPT-4, last updated in September 2021, and the other was conducted by human researchers. The human review involved a comprehensive literature search using medical subject headings and keywords in Ovid MEDLINE, followed by a thematic analysis of the literature to synthesize information from selected articles. The AI-generated review used a new prompt engineering approach, using iterative and sequential prompts to generate results. Comparative analysis was based on qualitative measures such as accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency.Results: GPT-4 produced an extensive list of relational factors rapidly. The AI model demonstrated an impressive breadth of knowledge but exhibited limitations in in-depth and contextual understanding, occasionally producing irrelevant or incorrect information. In comparison, human researchers provided a more nuanced and contextually relevant review. The comparative analysis assessed the reviews based on criteria including accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. While GPT-4 showed advantages in response time and breadth of knowledge, human-led reviews excelled in accuracy, depth of knowledge, and contextual understanding.Conclusions: The study suggests that GPT-4, with structured prompt engineering, can be a valuable tool for conducting preliminary literature reviews by providing a broad overview of topics quickly. However, its limitations necessitate careful expert evaluation and refinement, making it an assistant rather than a substitute for human expertise in comprehensive literature reviews. Moreover, this research highlights the potential and limitations of using AI tools like GPT-4 in academic research, particularly in the fields of health services and medical research. It underscores the necessity of combining AI's rapid information retrieval capabilities with human expertise for more accurate and contextually rich scholarly outputs.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e56537"},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The "Magical Theory" of AI in Medicine: Thematic Narrative Analysis. 人工智能在医学中的 "神奇理论"：专题叙事分析。

JMIR AI

Pub Date : 2024-08-19 DOI: 10.2196/49795

Giorgia Lorenzini, Laura Arbelaez Ossa, Stephen Milford, Bernice Simone Elger, David Martin Shaw, Eva De Clercq

Background: The discourse surrounding medical artificial intelligence (AI) often focuses on narratives that either hype the technology's potential or predict dystopian futures. AI narratives have a significant influence on the direction of research, funding, and public opinion and thus shape the future of medicine.

Objective: The paper aims to offer critical reflections on AI narratives, with a specific focus on medical AI, and to raise awareness as to how people working with medical AI talk about AI and discharge their "narrative responsibility."

Methods: Qualitative semistructured interviews were conducted with 41 participants from different disciplines who were exposed to medical AI in their profession. The research represents a secondary analysis of data using a thematic narrative approach. The analysis resulted in 2 main themes, each with 2 other subthemes.

Results: Stories about the AI-physician interaction depicted either a competitive or collaborative relationship. Some participants argued that AI might replace physicians, as it performs better than physicians. However, others believed that physicians should not be replaced and that AI should rather assist and support physicians. The idea of excessive technological deferral and automation bias was discussed, highlighting the risk of "losing" decisional power. The possibility that AI could relieve physicians from burnout and allow them to spend more time with patients was also considered. Finally, a few participants reported an extremely optimistic account of medical AI, while the majority criticized this type of story. The latter lamented the existence of a "magical theory" of medical AI, identified with techno-solutionist positions.

Conclusions: Most of the participants reported a nuanced view of technology, recognizing both its benefits and challenges and avoiding polarized narratives. However, some participants did contribute to the hype surrounding medical AI, comparing it to human capabilities and depicting it as superior. Overall, the majority agreed that medical AI should assist rather than replace clinicians. The study concludes that a balanced narrative (that focuses on the technology's present capabilities and limitations) is necessary to fully realize the potential of medical AI while avoiding unrealistic expectations and hype.

背景：围绕医学人工智能（AI）的讨论往往集中在对该技术潜力的炒作或对其未来的预言上。人工智能叙事对研究方向、资金和公众舆论具有重大影响，从而塑造了医学的未来：本文旨在对人工智能叙事进行批判性反思，特别关注医疗人工智能，并提高人们对从事医疗人工智能工作的人如何谈论人工智能以及如何履行其 "叙事责任 "的认识：对来自不同学科的 41 名参与者进行了定性半结构式访谈，他们在各自的职业中都接触过医疗人工智能。本研究采用专题叙事方法对数据进行了二次分析。分析得出 2 个主题，每个主题又有 2 个次主题：关于人工智能与医生互动的故事要么描述了一种竞争关系，要么描述了一种合作关系。一些参与者认为，人工智能可能会取代医生，因为它比医生表现得更好。但也有人认为，医生不应被取代，人工智能应该为医生提供帮助和支持。与会者讨论了过度技术推迟和自动化偏见的观点，强调了 "失去 "决策权的风险。还有人认为，人工智能可以减轻医生的职业倦怠，让他们有更多的时间与病人在一起。最后，少数与会者对医疗人工智能持极为乐观的态度，而大多数与会者则对这种说法提出了批评。后者对医疗人工智能 "神奇理论 "的存在表示遗憾，认为这是技术解决主义者的立场：大多数与会者对技术持细致入微的看法，既认识到技术的益处，也认识到技术的挑战，避免两极分化的叙述。不过，也有一些与会者助长了对医疗人工智能的炒作，将其与人类能力相提并论，并将其描绘得高人一等。总体而言，大多数人同意医疗人工智能应协助而非取代临床医生。研究得出结论，要充分发挥医疗人工智能的潜力，同时避免不切实际的期望和炒作，就必须进行平衡的叙述（侧重于技术的现有能力和局限性）。

{"title":"The \"Magical Theory\" of AI in Medicine: Thematic Narrative Analysis.","authors":"Giorgia Lorenzini, Laura Arbelaez Ossa, Stephen Milford, Bernice Simone Elger, David Martin Shaw, Eva De Clercq","doi":"10.2196/49795","DOIUrl":"10.2196/49795","url":null,"abstract":"Background: The discourse surrounding medical artificial intelligence (AI) often focuses on narratives that either hype the technology's potential or predict dystopian futures. AI narratives have a significant influence on the direction of research, funding, and public opinion and thus shape the future of medicine.Objective: The paper aims to offer critical reflections on AI narratives, with a specific focus on medical AI, and to raise awareness as to how people working with medical AI talk about AI and discharge their \"narrative responsibility.\"Methods: Qualitative semistructured interviews were conducted with 41 participants from different disciplines who were exposed to medical AI in their profession. The research represents a secondary analysis of data using a thematic narrative approach. The analysis resulted in 2 main themes, each with 2 other subthemes.Results: Stories about the AI-physician interaction depicted either a competitive or collaborative relationship. Some participants argued that AI might replace physicians, as it performs better than physicians. However, others believed that physicians should not be replaced and that AI should rather assist and support physicians. The idea of excessive technological deferral and automation bias was discussed, highlighting the risk of \"losing\" decisional power. The possibility that AI could relieve physicians from burnout and allow them to spend more time with patients was also considered. Finally, a few participants reported an extremely optimistic account of medical AI, while the majority criticized this type of story. The latter lamented the existence of a \"magical theory\" of medical AI, identified with techno-solutionist positions.Conclusions: Most of the participants reported a nuanced view of technology, recognizing both its benefits and challenges and avoiding polarized narratives. However, some participants did contribute to the hype surrounding medical AI, comparing it to human capabilities and depicting it as superior. Overall, the majority agreed that medical AI should assist rather than replace clinicians. The study concludes that a balanced narrative (that focuses on the technology's present capabilities and limitations) is necessary to fully realize the potential of medical AI while avoiding unrealistic expectations and hype.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e49795"},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142001500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study. 评估个性化医疗信息中的生成语言模型：工具验证研究

JMIR AI

Pub Date : 2024-08-13 DOI: 10.2196/54371

Aidin Spina, Saman Andalib, Daniel Flores, Rishi Vermani, Faris F Halaseh, Ariana M Nelson

Background: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy.

Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy.

Methods: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL).

Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels.

Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.

背景：尽管人工智能驱动的生成语言模型（GLMs）在实施方面还存在不确定性，但它在医学领域有着巨大的潜力。部署 GLMs 可以提高患者对临床文本的理解能力，改善低健康素养状况：本研究的目的是评估 ChatGPT-3.5 和 GPT-4 根据患者特定的输入教育水平调整医疗信息复杂度的潜力，如果要将其作为解决低健康素养问题的工具，这一点至关重要：方法：设计了与两种常见慢性疾病（II 型糖尿病和高血压）相关的输入模板。每个临床案例都根据假设的患者教育水平进行了调整，以评估输出的个性化程度。为了评估 GLM（GPT-3.5 和 GPT-4）在定制输出写作方面的成功率，使用弗莱什阅读难易度评分（FKRE）和弗莱什-金凯德等级评分（FKGL）对转换前和转换后输出的可读性进行了量化：使用 GPT-3.5 和 GPT-4 在 2 个临床案例中生成了反应（n=80）。对于 GPT-3.5，六年级、八年级、高中和本科生的 FKRE 平均分分别为 57.75（标清 4.75）、51.28（标清 5.14）、32.28（标清 4.52）和 28.31（标清 5.22）；FKGL 平均分分别为 9.08（标清 0.90）、10.27（标清 1.06）、13.4（标清 0.80）和 13.74（标清 1.18）。GPT-3.5 仅与预设的学士学位教育水平一致。相反，GPT-4 的 FKRE 平均分分别为 74.54 (SD 2.6)、71.25 (SD 4.96)、47.61 (SD 6.13) 和 13.71 (SD 5.77)，FKGL 平均分分别为 6.3 (SD 0.73)、6.7 (SD 1.11)、11.09 (SD 1.26) 和 17.03 (SD 1.11)，与各自的教育程度相同。除六年级 FKRE 平均值外，所有组别的 GPT-4 都达到了可读性目标。两个 GLM 的输出结果在统计上都有显著差异（PC 结论：GLM 可以根据输入指定的教育程度改变医学文本输出的结构和可读性。然而，GLMs 将输入的教育指定分为三大输出可读性等级：简单（六年级和八年级）、中等（高中）和困难（学士学位）。这是第一个表明 GLMs 在输出文本简化方面的成功存在更广泛界限的结果。未来的研究必须确定 GLMs 如何能根据预设的教育水平可靠地个性化医疗文本，从而对医疗保健素养产生更广泛的影响。

{"title":"Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study.","authors":"Aidin Spina, Saman Andalib, Daniel Flores, Rishi Vermani, Faris F Halaseh, Ariana M Nelson","doi":"10.2196/54371","DOIUrl":"10.2196/54371","url":null,"abstract":"Background: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy.Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy.Methods: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL).Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor's, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor's degree. Conversely, GPT-4's FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels.Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor's degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e54371"},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11350306/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141977364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records. 通过整合外部知识增强预训练语言模型的临床相关性：来自电子健康记录的心血管诊断案例研究。

JMIR AI

Pub Date : 2024-08-06 DOI: 10.2196/56932

Qiuhao Lu, Andrew Wen, Thien Nguyen, Hongfang Liu

Background: Despite their growing use in health care, pretrained language models (PLMs) often lack clinical relevance due to insufficient domain expertise and poor interpretability. A key strategy to overcome these challenges is integrating external knowledge into PLMs, enhancing their adaptability and clinical usefulness. Current biomedical knowledge graphs like UMLS (Unified Medical Language System), SNOMED CT (Systematized Medical Nomenclature for Medicine-Clinical Terminology), and HPO (Human Phenotype Ontology), while comprehensive, fail to effectively connect general biomedical knowledge with physician insights. There is an equally important need for a model that integrates diverse knowledge in a way that is both unified and compartmentalized. This approach not only addresses the heterogeneous nature of domain knowledge but also recognizes the unique data and knowledge repositories of individual health care institutions, necessitating careful and respectful management of proprietary information.

Objective: This study aimed to enhance the clinical relevance and interpretability of PLMs by integrating external knowledge in a manner that respects the diversity and proprietary nature of health care data. We hypothesize that domain knowledge, when captured and distributed as stand-alone modules, can be effectively reintegrated into PLMs to significantly improve their adaptability and utility in clinical settings.

Methods: We demonstrate that through adapters, small and lightweight neural networks that enable the integration of extra information without full model fine-tuning, we can inject diverse sources of external domain knowledge into language models and improve the overall performance with an increased level of interpretability. As a practical application of this methodology, we introduce a novel task, structured as a case study, that endeavors to capture physician knowledge in assigning cardiovascular diagnoses from clinical narratives, where we extract diagnosis-comment pairs from electronic health records (EHRs) and cast the problem as text classification.

Results: The study demonstrates that integrating domain knowledge into PLMs significantly improves their performance. While improvements with ClinicalBERT are more modest, likely due to its pretraining on clinical texts, BERT (bidirectional encoder representations from transformer) equipped with knowledge adapters surprisingly matches or exceeds ClinicalBERT in several metrics. This underscores the effectiveness of knowledge adapters and highlights their potential in settings with strict data privacy constraints. This approach also increases the level of interpretability of these models in a clinical context, which enhances our ability to precisely identify and apply the most relevant domain knowledge for specific tasks, thereby optimizing the model's performance and tailoring it to meet specific c

背景：尽管预训练语言模型（PLMs）在医疗保健领域的应用越来越广泛，但由于领域专业知识不足和可解释性差，它们往往缺乏临床相关性。克服这些挑战的关键策略是将外部知识整合到 PLM 中，增强其适应性和临床实用性。目前的生物医学知识图谱，如 UMLS（统一医学语言系统）、SNOMED CT（系统化医学术语-临床术语）和 HPO（人类表型本体），虽然很全面，但未能有效地将一般生物医学知识与医生的见解联系起来。同样重要的是，我们需要一种既能统一又能分门别类地整合各种知识的模型。这种方法不仅能解决领域知识的异质性问题，还能认识到各个医疗机构独特的数据和知识库，因此有必要对专有信息进行谨慎和尊重的管理：本研究旨在以尊重医疗数据多样性和专有性的方式整合外部知识，从而提高 PLM 的临床相关性和可解释性。我们假设，领域知识在作为独立模块采集和分发时，可以有效地重新整合到 PLM 中，从而显著提高其在临床环境中的适应性和实用性：我们证明，通过适配器（小型轻量级神经网络，无需对模型进行全面微调即可整合额外信息），我们可以将多种外部领域知识注入语言模型，并通过提高可解释性来改善整体性能。作为该方法的实际应用，我们介绍了一项新任务，该任务以案例研究的形式构建，旨在从临床叙述中获取医生在指定心血管诊断方面的知识，我们从电子健康记录（EHR）中提取诊断-评论对，并将该问题作为文本分类：研究表明，将领域知识整合到 PLM 中能显著提高 PLM 的性能。虽然ClinicalBERT的改进幅度较小，这可能是由于它对临床文本进行了预训练，但配备了知识适配器的BERT（来自转换器的双向编码器表征）在多项指标上竟然达到或超过了ClinicalBERT。这凸显了知识适配器的有效性，并彰显了其在有严格数据隐私限制的环境中的潜力。这种方法还提高了这些模型在临床环境中的可解释性，从而增强了我们为特定任务精确识别和应用最相关领域知识的能力，从而优化了模型的性能并使其满足特定的临床需求：这项研究为创建注入医生知识的健康知识图谱奠定了基础，标志着医疗保健领域的 PLM 迈出了重要一步。值得注意的是，该模型兼顾了知识的全面性和选择性，解决了医学知识的异质性和医疗机构的隐私需求。

{"title":"Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records.","authors":"Qiuhao Lu, Andrew Wen, Thien Nguyen, Hongfang Liu","doi":"10.2196/56932","DOIUrl":"10.2196/56932","url":null,"abstract":"Background: Despite their growing use in health care, pretrained language models (PLMs) often lack clinical relevance due to insufficient domain expertise and poor interpretability. A key strategy to overcome these challenges is integrating external knowledge into PLMs, enhancing their adaptability and clinical usefulness. Current biomedical knowledge graphs like UMLS (Unified Medical Language System), SNOMED CT (Systematized Medical Nomenclature for Medicine-Clinical Terminology), and HPO (Human Phenotype Ontology), while comprehensive, fail to effectively connect general biomedical knowledge with physician insights. There is an equally important need for a model that integrates diverse knowledge in a way that is both unified and compartmentalized. This approach not only addresses the heterogeneous nature of domain knowledge but also recognizes the unique data and knowledge repositories of individual health care institutions, necessitating careful and respectful management of proprietary information.Objective: This study aimed to enhance the clinical relevance and interpretability of PLMs by integrating external knowledge in a manner that respects the diversity and proprietary nature of health care data. We hypothesize that domain knowledge, when captured and distributed as stand-alone modules, can be effectively reintegrated into PLMs to significantly improve their adaptability and utility in clinical settings.Methods: We demonstrate that through adapters, small and lightweight neural networks that enable the integration of extra information without full model fine-tuning, we can inject diverse sources of external domain knowledge into language models and improve the overall performance with an increased level of interpretability. As a practical application of this methodology, we introduce a novel task, structured as a case study, that endeavors to capture physician knowledge in assigning cardiovascular diagnoses from clinical narratives, where we extract diagnosis-comment pairs from electronic health records (EHRs) and cast the problem as text classification.Results: The study demonstrates that integrating domain knowledge into PLMs significantly improves their performance. While improvements with ClinicalBERT are more modest, likely due to its pretraining on clinical texts, BERT (bidirectional encoder representations from transformer) equipped with knowledge adapters surprisingly matches or exceeds ClinicalBERT in several metrics. This underscores the effectiveness of knowledge adapters and highlights their potential in settings with strict data privacy constraints. This approach also increases the level of interpretability of these models in a clinical context, which enhances our ability to precisely identify and apply the most relevant domain knowledge for specific tasks, thereby optimizing the model's performance and tailoring it to meet specific c","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e56932"},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11336492/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141894950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

JMIR AI最新文献