Pub Date : 2025-02-04eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooaf004
Jamie M Faro, Emily Obermiller, Corey Obermiller, Katy E Trinkley, Garth Wright, Rajani S Sadasivam, Kristie L Foley, Sarah L Cutrona, Thomas K Houston
Background: Digital health (patient portals, remote monitoring devices, video visits) is a routine part of health care, though the digital divide may affect access.
Objectives: To test and validate an electronic health record (EHR) screening tool to identify patients at risk of the digital divide.
Materials and methods: We conducted a retrospective EHR data extraction and cross-sectional survey of participants within 1 health care system. We identified 4 potential digital divide markers from the EHR: (1) mobile phone number, (2) email address, (3) active patient portal, and (4) >2 patient portal logins in the last year. We mailed surveys to patients at higher risk (missing all 4 markers), intermediate risk (missing 1-3 markers), or lower risk (missing no markers). Combining EHR and survey data, we summarized the markers into risk scores and evaluated its association with patients' report of lack of Internet access. Then, we assessed the association of EHR markers and eHealth Literacy Scale survey outcomes.
Results: A total of 249 patients (39.4%) completed the survey (53%>65 years, 51% female, 50% minority race, 55% rural/small town residents, 46% private insurance, 45% Medicare). Individually, the 4 EHR markers had high sensitivity (range 81%-95%) and specificity (range 65%-79%) compared with survey responses. The EHR marker-based score (high risk, intermediate risk, low risk) predicted absence of Internet access (receiver operator characteristics c-statistic=0.77). Mean digital health literacy scores significantly decreased as her marker digital divide risk increased (P <.001).
Discussion: Each of the four EHR markers (Cell phone, email address, patient portal active, and patient portal actively used) compared with self-report yielded high levels of sensitivity, specificity, and overall accuracy.
Conclusion: Using these markers, health care systems could target interventions and implementation strategies to support equitable patient access to digital health.
{"title":"Using routinely available electronic health record data elements to develop and validate a digital divide risk score.","authors":"Jamie M Faro, Emily Obermiller, Corey Obermiller, Katy E Trinkley, Garth Wright, Rajani S Sadasivam, Kristie L Foley, Sarah L Cutrona, Thomas K Houston","doi":"10.1093/jamiaopen/ooaf004","DOIUrl":"10.1093/jamiaopen/ooaf004","url":null,"abstract":"<p><strong>Background: </strong>Digital health (patient portals, remote monitoring devices, video visits) is a routine part of health care, though the digital divide may affect access.</p><p><strong>Objectives: </strong>To test and validate an electronic health record (EHR) screening tool to identify patients at risk of the digital divide.</p><p><strong>Materials and methods: </strong>We conducted a retrospective EHR data extraction and cross-sectional survey of participants within 1 health care system. We identified 4 potential digital divide markers from the EHR: (1) mobile phone number, (2) email address, (3) active patient portal, and (4) >2 patient portal logins in the last year. We mailed surveys to patients at higher risk (missing all 4 markers), intermediate risk (missing 1-3 markers), or lower risk (missing no markers). Combining EHR and survey data, we summarized the markers into risk scores and evaluated its association with patients' report of lack of Internet access. Then, we assessed the association of EHR markers and eHealth Literacy Scale survey outcomes.</p><p><strong>Results: </strong>A total of 249 patients (39.4%) completed the survey (53%>65 years, 51% female, 50% minority race, 55% rural/small town residents, 46% private insurance, 45% Medicare). Individually, the 4 EHR markers had high sensitivity (range 81%-95%) and specificity (range 65%-79%) compared with survey responses. The EHR marker-based score (high risk, intermediate risk, low risk) predicted absence of Internet access (receiver operator characteristics <i>c</i>-statistic=0.77). Mean digital health literacy scores significantly decreased as her marker digital divide risk increased (<i>P</i> <.001).</p><p><strong>Discussion: </strong>Each of the four EHR markers (Cell phone, email address, patient portal active, and patient portal actively used) compared with self-report yielded high levels of sensitivity, specificity, and overall accuracy.</p><p><strong>Conclusion: </strong>Using these markers, health care systems could target interventions and implementation strategies to support equitable patient access to digital health.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooaf004"},"PeriodicalIF":2.5,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792649/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143190786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: The automatic detection of stance on social media is an important task for public health applications, especially in the context of health crises. Unfortunately, existing models are typically trained on English corpora. Considering the benefits of extending research to other widely spoken languages, the goal of this study is to develop stance detection models for social media posts in Spanish.
Materials and methods: A corpus of 6170 tweets about COVID-19 vaccination, posted between March 1, 2020 and January 4, 2022, was manually annotated by native speakers. Traditional predictive models were compared with deep learning models to ascertain a baseline performance for the detection of stance in Spanish tweets. The evaluation focused on the ability of multilingual and language-specific embeddings to contextualize the topic of those short texts adequately.
Results: The BERT-Multi+BiLSTM combination yielded the best results (macroaveraged F1 and Matthews correlation coefficient scores of 0.86 and 0.79, respectively; interpolated area under the receiver operating curve [AUC] of 0.95 for tweets against vaccination and 0.85 in favor of vaccination and a score of 0.97 for tweets containing no stance information), closely followed by the BETO+BiLSTM and RoBERTa BNE-LSTM Spanish models and the term frequency-inverse document frequency+SVM model (average AUC decrease of 0.01). The main differentiating factor among these models was the ability to predict tweets against vaccination.
Discussion: The BERT Multi+BILSTM model outperformed the other models in terms of per class prediction capacity. The main assumption is that language-specific embeddings do not outperform multilingual embeddings or TF-IDF features because of the context of the topic. The inherent context of BERT or RoBERTa embeddings is general. So, these embeddings are not familiar with the slang commonly used on Twitter and, more specifically, during the pandemic.
Conclusion: The best performing model detects tweet stance with performance high enough to ensure its usefulness for public health applications, namely awareness campaigns, misinformation detection and other early intervention and prevention actions seeking to improve an individual's well-being based on autoreported experiences and opinions. The dataset and code of the study are available on GitHub.
{"title":"Leveraging deep learning to detect stance in Spanish tweets on COVID-19 vaccination.","authors":"Guillermo Blanco, Rubén Yáñez Martínez, Anália Lourenço","doi":"10.1093/jamiaopen/ooaf007","DOIUrl":"10.1093/jamiaopen/ooaf007","url":null,"abstract":"<p><strong>Objectives: </strong>The automatic detection of stance on social media is an important task for public health applications, especially in the context of health crises. Unfortunately, existing models are typically trained on English corpora. Considering the benefits of extending research to other widely spoken languages, the goal of this study is to develop stance detection models for social media posts in Spanish.</p><p><strong>Materials and methods: </strong>A corpus of 6170 tweets about COVID-19 vaccination, posted between March 1, 2020 and January 4, 2022, was manually annotated by native speakers. Traditional predictive models were compared with deep learning models to ascertain a baseline performance for the detection of stance in Spanish tweets. The evaluation focused on the ability of multilingual and language-specific embeddings to contextualize the topic of those short texts adequately.</p><p><strong>Results: </strong>The BERT-Multi+BiLSTM combination yielded the best results (macroaveraged F1 and Matthews correlation coefficient scores of 0.86 and 0.79, respectively; interpolated area under the receiver operating curve [AUC] of 0.95 for tweets against vaccination and 0.85 in favor of vaccination and a score of 0.97 for tweets containing no stance information), closely followed by the BETO+BiLSTM and RoBERTa BNE-LSTM Spanish models and the term frequency-inverse document frequency+SVM model (average AUC decrease of 0.01). The main differentiating factor among these models was the ability to predict tweets against vaccination.</p><p><strong>Discussion: </strong>The BERT Multi+BILSTM model outperformed the other models in terms of per class prediction capacity. The main assumption is that language-specific embeddings do not outperform multilingual embeddings or TF-IDF features because of the context of the topic. The inherent context of BERT or RoBERTa embeddings is general. So, these embeddings are not familiar with the slang commonly used on Twitter and, more specifically, during the pandemic.</p><p><strong>Conclusion: </strong>The best performing model detects tweet stance with performance high enough to ensure its usefulness for public health applications, namely awareness campaigns, misinformation detection and other early intervention and prevention actions seeking to improve an individual's well-being based on autoreported experiences and opinions. The dataset and code of the study are available on GitHub.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooaf007"},"PeriodicalIF":2.5,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11854073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143504472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-22eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae152
Vaakesan Sundrelingam, Shireen Parimoo, Frances Pogacar, Radha Koppula, Saeha Shin, Chloe Pou-Prom, Surain B Roberts, Amol A Verma, Fahad Razak
Objectives: Deidentification of personally identifiable information in free-text clinical data is fundamental to making these data broadly available for research. However, there exist gaps in the deidentification landscape with regard to the functionality and flexibility of extant tools, as well as suboptimal tradeoffs between deidentification accuracy and speed. To address these gaps and tradeoffs, we develop a new Python-based deidentification software, pyDeid.
Materials and methods: pyDeid uses a combination of regular expression-based rules, fixed exclusion lists and inclusion lists to deidentify free-text data. Additional configurations of pyDeid include optional named entity recognition and custom name lists. We measure its deidentification performance and speed on 700 admission notes from a Canadian hospital, the publicly available n2c2 benchmark dataset of American discharge notes, as well as a synthetic dataset of artificial intelligence (AI) generated admission notes. We also compare its performance with the Physionet De-identification Software and the popular open-source Philter tool.
Results: Different configurations of pyDeid outperformed other tools on various metrics, with a "best" accuracy value of 0.988, best precision of 0.889, best recall of 0.950, and best F1 score of 0.904. All configurations of pyDeid were significantly faster than Philter and Physionet De-identification Software, with the fastest deidentification speed of 0.48 s per note.
Discussion and conclusions: pyDeid allows the flexibility to prioritize between performance and speed, as well as precision and recall, while addressing some of the gaps in functionality left by other tools. pyDeid is also generalizable to domains outside of clinical data and can be further customized for specific contexts or for particular workflows.
{"title":"pyDeid: an improved, fast, flexible, and generalizable rule-based approach for deidentification of free-text medical records.","authors":"Vaakesan Sundrelingam, Shireen Parimoo, Frances Pogacar, Radha Koppula, Saeha Shin, Chloe Pou-Prom, Surain B Roberts, Amol A Verma, Fahad Razak","doi":"10.1093/jamiaopen/ooae152","DOIUrl":"10.1093/jamiaopen/ooae152","url":null,"abstract":"<p><strong>Objectives: </strong>Deidentification of personally identifiable information in free-text clinical data is fundamental to making these data broadly available for research. However, there exist gaps in the deidentification landscape with regard to the functionality and flexibility of extant tools, as well as suboptimal tradeoffs between deidentification accuracy and speed. To address these gaps and tradeoffs, we develop a new Python-based deidentification software, pyDeid.</p><p><strong>Materials and methods: </strong>pyDeid uses a combination of regular expression-based rules, fixed exclusion lists and inclusion lists to deidentify free-text data. Additional configurations of pyDeid include optional named entity recognition and custom name lists. We measure its deidentification performance and speed on 700 admission notes from a Canadian hospital, the publicly available n2c2 benchmark dataset of American discharge notes, as well as a synthetic dataset of artificial intelligence (AI) generated admission notes. We also compare its performance with the Physionet De-identification Software and the popular open-source Philter tool.</p><p><strong>Results: </strong>Different configurations of pyDeid outperformed other tools on various metrics, with a \"best\" accuracy value of 0.988, best precision of 0.889, best recall of 0.950, and best F1 score of 0.904. All configurations of pyDeid were significantly faster than Philter and Physionet De-identification Software, with the fastest deidentification speed of 0.48 s per note.</p><p><strong>Discussion and conclusions: </strong>pyDeid allows the flexibility to prioritize between performance and speed, as well as precision and recall, while addressing some of the gaps in functionality left by other tools. pyDeid is also generalizable to domains outside of clinical data and can be further customized for specific contexts or for particular workflows.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae152"},"PeriodicalIF":2.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11752853/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-22eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae157
Aditi Gupta, Ethan Hillis, Inez Y Oh, Stephanie M Morris, Zach Abrams, Randi E Foraker, David H Gutmann, Philip R O Payne
Objective: Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (EHRs) on the performance of ML models for predicting the development of various sub-phenotypes in children with Neurofibromatosis type 1 (NF1).
Materials and methods: EHR-derived data from pediatric subjects with a confirmed clinical diagnosis of NF1 were used to create 10 unique comorbidities code-derived feature sets by incorporating dimensionality reduction techniques using raw International Classification of Diseases codes, Clinical Classifications Software Refined, and Phecode mapping schemes. We compared the performance of logistic regression, XGBoost, and random forest models utilizing each feature set.
Results: XGBoost-based predictive models were most successful at predicting NF1 sub-phenotypes. Overall, features based on domain knowledge-informed mapping schema performed better than unsupervised feature reduction methods. High-level features exhibited the worst performance across models and outcomes, suggesting excessive information loss with over-aggregation of features.
Discussion: Model performance is significantly impacted by dimensionality reduction techniques and varies by specific ML algorithm and outcome being predicted. Automated methods using existing knowledge and ontology databases can effectively aggregate features extracted from EHRs.
Conclusion: Dimensionality reduction through feature aggregation can enhance the performance of ML models, particularly in high-dimensional datasets with small sample sizes, commonly found in EHRs health applications. However, if not carefully optimized, it can lead to information loss and data oversimplification, potentially adversely affecting model performance.
{"title":"Evaluating dimensionality reduction of comorbidities for predictive modeling in individuals with neurofibromatosis type 1.","authors":"Aditi Gupta, Ethan Hillis, Inez Y Oh, Stephanie M Morris, Zach Abrams, Randi E Foraker, David H Gutmann, Philip R O Payne","doi":"10.1093/jamiaopen/ooae157","DOIUrl":"10.1093/jamiaopen/ooae157","url":null,"abstract":"<p><strong>Objective: </strong>Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (EHRs) on the performance of ML models for predicting the development of various sub-phenotypes in children with Neurofibromatosis type 1 (NF1).</p><p><strong>Materials and methods: </strong>EHR-derived data from pediatric subjects with a confirmed clinical diagnosis of NF1 were used to create 10 unique comorbidities code-derived feature sets by incorporating dimensionality reduction techniques using raw International Classification of Diseases codes, Clinical Classifications Software Refined, and Phecode mapping schemes. We compared the performance of logistic regression, XGBoost, and random forest models utilizing each feature set.</p><p><strong>Results: </strong>XGBoost-based predictive models were most successful at predicting NF1 sub-phenotypes. Overall, features based on domain knowledge-informed mapping schema performed better than unsupervised feature reduction methods. High-level features exhibited the worst performance across models and outcomes, suggesting excessive information loss with over-aggregation of features.</p><p><strong>Discussion: </strong>Model performance is significantly impacted by dimensionality reduction techniques and varies by specific ML algorithm and outcome being predicted. Automated methods using existing knowledge and ontology databases can effectively aggregate features extracted from EHRs.</p><p><strong>Conclusion: </strong>Dimensionality reduction through feature aggregation can enhance the performance of ML models, particularly in high-dimensional datasets with small sample sizes, commonly found in EHRs health applications. However, if not carefully optimized, it can lead to information loss and data oversimplification, potentially adversely affecting model performance.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae157"},"PeriodicalIF":2.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11752863/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143024873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae159
Norina Gasteiger, Gill Norman, Rebecca Grainger, Sabine N van der Veer, Lisa McGarrigle, Debra Jones, Charlotte Eost-Telling, Amy Vercell, Claire R Ford, Syed Mustafa Ali, Kate Law, Qimeng Zhao, Matthew Byerly, Chunhu Shi, Alan Davies, Alex Hall, Dawn Dowding
Objectives: There is no guidance to support the reporting of systematic reviews of mobile health (mhealth) apps (app reviews), so authors attempt to use/modify the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). There is a need for reporting guidance, building on PRISMA where appropriate, tailored to app reviews. The objectives were to describe the reporting quality of published mHealth app reviews, identify the need for, and develop potential candidate items for a reporting guideline.
Materials and methods: A scoping review following the Joanna Briggs Institute and Arksey and O'Malley approaches. App reviews were identified in January 2024 from SCOPUS, CINAHL, AMED, EMBASE, Medline, PsycINFO, ACM Digital Library, snowballing reference lists, and forward citation searches. Data were extracted into Excel and analyzed using descriptive statistics and content synthesis, using PRISMA items as a framework.
Results: One hundred and seventy-one app reviews were identified, published from 2013 to 2024. Protocols were developed for 11% of the reviews, and only 52% reported the geographical location of the app markets. Few reported the duplicate removal process (12%), device and operating system used (30%), or made clear recommendations for the best-rated apps (18%). Nineteen PRISMA items were not reported by most (>85%) reviews, and 4 were modified by >30% of the reviews. Involvement of patient/public contributors (4%) or other stakeholders (11%) was infrequent. Overall, 34 candidate items and 10 subitems were identified to be considered for a new guideline.
Discussion and conclusion: App reviews were inconsistently reported, and many PRISMA items were not deemed relevant. Consensus work is needed to revise and prioritize the candidate items for a reporting guideline for systematic app reviews.
{"title":"A scoping review of the reporting quality of reviews of commercially and publicly available mobile health apps.","authors":"Norina Gasteiger, Gill Norman, Rebecca Grainger, Sabine N van der Veer, Lisa McGarrigle, Debra Jones, Charlotte Eost-Telling, Amy Vercell, Claire R Ford, Syed Mustafa Ali, Kate Law, Qimeng Zhao, Matthew Byerly, Chunhu Shi, Alan Davies, Alex Hall, Dawn Dowding","doi":"10.1093/jamiaopen/ooae159","DOIUrl":"10.1093/jamiaopen/ooae159","url":null,"abstract":"<p><strong>Objectives: </strong>There is no guidance to support the reporting of systematic reviews of mobile health (mhealth) apps (app reviews), so authors attempt to use/modify the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). There is a need for reporting guidance, building on PRISMA where appropriate, tailored to app reviews. The objectives were to describe the reporting quality of published mHealth app reviews, identify the need for, and develop potential candidate items for a reporting guideline.</p><p><strong>Materials and methods: </strong>A scoping review following the Joanna Briggs Institute and Arksey and O'Malley approaches. App reviews were identified in January 2024 from SCOPUS, CINAHL, AMED, EMBASE, Medline, PsycINFO, ACM Digital Library, snowballing reference lists, and forward citation searches. Data were extracted into Excel and analyzed using descriptive statistics and content synthesis, using PRISMA items as a framework.</p><p><strong>Results: </strong>One hundred and seventy-one app reviews were identified, published from 2013 to 2024. Protocols were developed for 11% of the reviews, and only 52% reported the geographical location of the app markets. Few reported the duplicate removal process (12%), device and operating system used (30%), or made clear recommendations for the best-rated apps (18%). Nineteen PRISMA items were not reported by most (>85%) reviews, and 4 were modified by >30% of the reviews. Involvement of patient/public contributors (4%) or other stakeholders (11%) was infrequent. Overall, 34 candidate items and 10 subitems were identified to be considered for a new guideline.</p><p><strong>Discussion and conclusion: </strong>App reviews were inconsistently reported, and many PRISMA items were not deemed relevant. Consensus work is needed to revise and prioritize the candidate items for a reporting guideline for systematic app reviews.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae159"},"PeriodicalIF":2.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729727/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142984990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae154
Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy Miller, Danielle S Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew M Churpek, Majid Afshar
Objective: To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.
Materials and methods: We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.
Results: The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.
Discussion: These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.
Conclusions: LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.
{"title":"Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.","authors":"Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy Miller, Danielle S Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew M Churpek, Majid Afshar","doi":"10.1093/jamiaopen/ooae154","DOIUrl":"10.1093/jamiaopen/ooae154","url":null,"abstract":"<p><strong>Objective: </strong>To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.</p><p><strong>Materials and methods: </strong>We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.</p><p><strong>Results: </strong>The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.</p><p><strong>Discussion: </strong>These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.</p><p><strong>Conclusions: </strong>LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae154"},"PeriodicalIF":2.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723528/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142972440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae129
Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Melanie Huston, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, James G Mork, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Craig Wallin, Zhiyong Lu
Objectives: The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database. Thus, the information is interconnected with all the NLM resources, services which bring considerable value to life sciences. National Library of Medicine aims to provide timely access to all metadata, and this necessitates that the article indexing scales to the volume of the published literature. On the other hand, although automatic information extraction methods have been shown to achieve accurate results in biomedical text mining research, it remains difficult to evaluate them on established pipelines and integrate them within the daily workflows.
Materials and methods: Here, we demonstrate how our machine learning model, GNorm2, which achieved state-of-the art performance on identifying genes and their corresponding species at the same time handling innate textual ambiguities, could be integrated with the established daily workflow at the NLM and evaluated for its performance in this new environment.
Results: We worked with 8 biomedical curator experts and evaluated the integration using these parameters: (1) gene identification accuracy, (2) interannotator agreement with and without GNorm2, (3) GNorm2 potential bias, and (4) indexing consistency and efficiency. We identified key interface changes that significantly helped the curators to maximize the GNorm2 benefit, and further improved the GNorm2 algorithm to cover 135 species of genes including viral and bacterial genes, based on the biocurator expert survey.
Conclusion: GNorm2 is currently in the process of being fully integrated into the regular curator's workflow.
{"title":"Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine).","authors":"Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Melanie Huston, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, James G Mork, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Craig Wallin, Zhiyong Lu","doi":"10.1093/jamiaopen/ooae129","DOIUrl":"10.1093/jamiaopen/ooae129","url":null,"abstract":"<p><strong>Objectives: </strong>The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database. Thus, the information is interconnected with all the NLM resources, services which bring considerable value to life sciences. National Library of Medicine aims to provide timely access to all metadata, and this necessitates that the article indexing scales to the volume of the published literature. On the other hand, although automatic information extraction methods have been shown to achieve accurate results in biomedical text mining research, it remains difficult to evaluate them on established pipelines and integrate them within the daily workflows.</p><p><strong>Materials and methods: </strong>Here, we demonstrate how our machine learning model, GNorm2, which achieved state-of-the art performance on identifying genes and their corresponding species at the same time handling innate textual ambiguities, could be integrated with the established daily workflow at the NLM and evaluated for its performance in this new environment.</p><p><strong>Results: </strong>We worked with 8 biomedical curator experts and evaluated the integration using these parameters: (1) gene identification accuracy, (2) interannotator agreement with and without GNorm2, (3) GNorm2 potential bias, and (4) indexing consistency and efficiency. We identified key interface changes that significantly helped the curators to maximize the GNorm2 benefit, and further improved the GNorm2 algorithm to cover 135 species of genes including viral and bacterial genes, based on the biocurator expert survey.</p><p><strong>Conclusion: </strong>GNorm2 is currently in the process of being fully integrated into the regular curator's workflow.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae129"},"PeriodicalIF":2.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706533/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae150
Megan E Gregory, Suranga N Kasthurirathne, Tanja Magoc, Cassidy McNamee, Christopher A Harle, Joshua R Vest
Objective: Measurement of health-related social needs (HRSNs) is complex. We sought to develop and validate computable phenotypes (CPs) using structured electronic health record (EHR) data for food insecurity, housing instability, financial insecurity, transportation barriers, and a composite-type measure of these, using human-defined rule-based and machine learning (ML) classifier approaches.
Materials and methods: We collected HRSN surveys as the reference standard and obtained EHR data from 1550 patients in 3 health systems from 2 states. We followed a Delphi-like approach to develop the human-defined rule-based CP. For the ML classifier approach, we trained supervised ML (XGBoost) models using 78 features. Using surveys as the reference standard, we calculated sensitivity, specificity, positive predictive values, and area under the curve (AUC). We compared AUCs using the Delong test and other performance measures using McNemar's test, and checked for differential performance.
Results: Most patients (63%) reported at least one HRSN on the reference standard survey. Human-defined rule-based CPs exhibited poor performance (AUCs=.52 to .68). ML classifier CPs performed significantly better, but still poor-to-fair (AUCs = .68 to .75). Significant differences for race/ethnicity were found for ML classifier CPs (higher AUCs for White non-Hispanic patients). Important features included number of encounters and Medicaid insurance.
Discussion: Using a supervised ML classifier approach, HRSN CPs approached thresholds of fair performance, but exhibited differential performance by race/ethnicity.
Conclusion: CPs may help to identify patients who may benefit from additional social needs screening. Future work should explore the use of area-level features via geospatial data and natural language processing to improve model performance.
{"title":"Development and validation of computable social phenotypes for health-related social needs.","authors":"Megan E Gregory, Suranga N Kasthurirathne, Tanja Magoc, Cassidy McNamee, Christopher A Harle, Joshua R Vest","doi":"10.1093/jamiaopen/ooae150","DOIUrl":"10.1093/jamiaopen/ooae150","url":null,"abstract":"<p><strong>Objective: </strong>Measurement of health-related social needs (HRSNs) is complex. We sought to develop and validate computable phenotypes (CPs) using structured electronic health record (EHR) data for food insecurity, housing instability, financial insecurity, transportation barriers, and a composite-type measure of these, using human-defined rule-based and machine learning (ML) classifier approaches.</p><p><strong>Materials and methods: </strong>We collected HRSN surveys as the reference standard and obtained EHR data from 1550 patients in 3 health systems from 2 states. We followed a Delphi-like approach to develop the human-defined rule-based CP. For the ML classifier approach, we trained supervised ML (XGBoost) models using 78 features. Using surveys as the reference standard, we calculated sensitivity, specificity, positive predictive values, and area under the curve (AUC). We compared AUCs using the Delong test and other performance measures using McNemar's test, and checked for differential performance.</p><p><strong>Results: </strong>Most patients (63%) reported at least one HRSN on the reference standard survey. Human-defined rule-based CPs exhibited poor performance (AUCs=.52 to .68). ML classifier CPs performed significantly better, but still poor-to-fair (AUCs = .68 to .75). Significant differences for race/ethnicity were found for ML classifier CPs (higher AUCs for White non-Hispanic patients). Important features included number of encounters and Medicaid insurance.</p><p><strong>Discussion: </strong>Using a supervised ML classifier approach, HRSN CPs approached thresholds of fair performance, but exhibited differential performance by race/ethnicity.</p><p><strong>Conclusion: </strong>CPs may help to identify patients who may benefit from additional social needs screening. Future work should explore the use of area-level features via geospatial data and natural language processing to improve model performance.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae150"},"PeriodicalIF":2.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae158
Daoyi Zhu, Bing Xue, Neel Shah, Philip Richard Orrin Payne, Chenyang Lu, Ahmed Sameh Said
Objective: Extracorporeal membrane oxygenation (ECMO) is among the most resource-intensive therapies in critical care. The COVID-19 pandemic highlighted the lack of ECMO resource allocation tools. We aimed to develop a continuous ECMO risk prediction model to enhance patient triage and resource allocation.
Material and methods: We leveraged multimodal data from the National COVID Cohort Collaborative (N3C) to develop a hierarchical deep learning model, labeled "PreEMPT-ECMO" (Prediction, Early Monitoring, and Proactive Triage for ECMO) which integrates static and multi-granularity time series features to generate continuous predictions of ECMO utilization. Model performance was assessed across time points ranging from 0 to 96 hours prior to ECMO initiation, using both accuracy and precision metrics.
Results: Between January 2020 and May 2023, 101 400 patients were included, with 1298 (1.28%) supported on ECMO. PreEMPT-ECMO outperformed established predictive models, including Logistic Regression, Support Vector Machine, Random Forest, and Extreme Gradient Boosting Tree, in both accuracy and precision at all time points. Model interpretation analysis also highlighted variations in feature contributions through each patient's clinical course.
Discussion and conclusions: We developed a hierarchical model for continuous ECMO use prediction, utilizing a large multicenter dataset incorporating both static and time series variables of various granularities. This novel approach reflects the nuanced decision-making process inherent in ECMO initiation and has the potential to be used as an early alert tool to guide patient triage and ECMO resource allocation. Future directions include prospective validation and generalizability on non-COVID-19 refractory respiratory failure, aiming to improve patient outcomes.
{"title":"Multi-modal prediction of extracorporeal support-a resource intensive therapy, utilizing a large national database.","authors":"Daoyi Zhu, Bing Xue, Neel Shah, Philip Richard Orrin Payne, Chenyang Lu, Ahmed Sameh Said","doi":"10.1093/jamiaopen/ooae158","DOIUrl":"10.1093/jamiaopen/ooae158","url":null,"abstract":"<p><strong>Objective: </strong>Extracorporeal membrane oxygenation (ECMO) is among the most resource-intensive therapies in critical care. The COVID-19 pandemic highlighted the lack of ECMO resource allocation tools. We aimed to develop a continuous ECMO risk prediction model to enhance patient triage and resource allocation.</p><p><strong>Material and methods: </strong>We leveraged multimodal data from the National COVID Cohort Collaborative (N3C) to develop a hierarchical deep learning model, labeled \"PreEMPT-ECMO\" (Prediction, Early Monitoring, and Proactive Triage for ECMO) which integrates static and multi-granularity time series features to generate continuous predictions of ECMO utilization. Model performance was assessed across time points ranging from 0 to 96 hours prior to ECMO initiation, using both accuracy and precision metrics.</p><p><strong>Results: </strong>Between January 2020 and May 2023, 101 400 patients were included, with 1298 (1.28%) supported on ECMO. PreEMPT-ECMO outperformed established predictive models, including Logistic Regression, Support Vector Machine, Random Forest, and Extreme Gradient Boosting Tree, in both accuracy and precision at all time points. Model interpretation analysis also highlighted variations in feature contributions through each patient's clinical course.</p><p><strong>Discussion and conclusions: </strong>We developed a hierarchical model for continuous ECMO use prediction, utilizing a large multicenter dataset incorporating both static and time series variables of various granularities. This novel approach reflects the nuanced decision-making process inherent in ECMO initiation and has the potential to be used as an early alert tool to guide patient triage and ECMO resource allocation. Future directions include prospective validation and generalizability on non-COVID-19 refractory respiratory failure, aiming to improve patient outcomes.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae158"},"PeriodicalIF":2.5,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702361/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143024995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06eCollection Date: 2025-02-01DOI: 10.1093/jamiaopen/ooae156
Pedro J Caraballo, Anne M Meehan, Karen M Fischer, Parvez Rahman, Gyorgy J Simon, Genevieve B Melton, Hojjat Salehinejad, Bijan J Borah
Objectives: In the general hospital wards, machine learning (ML)-based early warning systems (EWSs) can identify patients at risk of deterioration to facilitate rescue interventions. We assess subpopulation performance of a ML-based EWS on medical and surgical adult patients admitted to general hospital wards.
Materials and methods: We assessed the scores of an EWS integrated into the electronic health record and calculated every 15 minutes to predict a composite adverse event (AE): all-cause mortality, transfer to intensive care, cardiac arrest, or rapid response team evaluation. The distributions of the First Score 3 hours after admission, the Highest Score at any time during the hospitalization, and the Last Score just before an AE or dismissal without an AE were calculated. The Last Score was used to calculate the area under the receiver operating characteristic curve (ROC-AUC) and the precision-recall curve (PRC-AUC).
Results: From August 23, 2021 to March 31, 2022, 35 937 medical admissions had 2173 (6.05%) AE compared to 25 214 surgical admissions with 4984 (19.77%) AE. Medical and surgical admissions had significant different (P <.001) distributions of the First Score, Highest Score, and Last Score among those with an AE and without an AE. The model performed better in the medical group when compared to the surgical group, ROC-AUC 0.869 versus 0.677, and RPC-AUC 0.988 versus 0.878, respectively.
Discussion: Heterogeneity of medical and surgical patients can significantly impact the performance of a ML-based EWS, changing the model validity and clinical discernment.
Conclusions: Characterization of the target patient subpopulations has clinical implications and should be considered when developing models to be used in general hospital wards.
{"title":"Trustworthiness of a machine learning early warning model in medical and surgical inpatients.","authors":"Pedro J Caraballo, Anne M Meehan, Karen M Fischer, Parvez Rahman, Gyorgy J Simon, Genevieve B Melton, Hojjat Salehinejad, Bijan J Borah","doi":"10.1093/jamiaopen/ooae156","DOIUrl":"10.1093/jamiaopen/ooae156","url":null,"abstract":"<p><strong>Objectives: </strong>In the general hospital wards, machine learning (ML)-based early warning systems (EWSs) can identify patients at risk of deterioration to facilitate rescue interventions. We assess subpopulation performance of a ML-based EWS on medical and surgical adult patients admitted to general hospital wards.</p><p><strong>Materials and methods: </strong>We assessed the scores of an EWS integrated into the electronic health record and calculated every 15 minutes to predict a composite adverse event (AE): all-cause mortality, transfer to intensive care, cardiac arrest, or rapid response team evaluation. The distributions of the First Score 3 hours after admission, the Highest Score at any time during the hospitalization, and the Last Score just before an AE or dismissal without an AE were calculated. The Last Score was used to calculate the area under the receiver operating characteristic curve (ROC-AUC) and the precision-recall curve (PRC-AUC).</p><p><strong>Results: </strong>From August 23, 2021 to March 31, 2022, 35 937 medical admissions had 2173 (6.05%) AE compared to 25 214 surgical admissions with 4984 (19.77%) AE. Medical and surgical admissions had significant different (<i>P</i> <.001) distributions of the First Score, Highest Score, and Last Score among those with an AE and without an AE. The model performed better in the medical group when compared to the surgical group, ROC-AUC 0.869 versus 0.677, and RPC-AUC 0.988 versus 0.878, respectively.</p><p><strong>Discussion: </strong>Heterogeneity of medical and surgical patients can significantly impact the performance of a ML-based EWS, changing the model validity and clinical discernment.</p><p><strong>Conclusions: </strong>Characterization of the target patient subpopulations has clinical implications and should be considered when developing models to be used in general hospital wards.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae156"},"PeriodicalIF":2.5,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702360/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}