Objective: To develop an advanced multi-task large language model (LLM) framework for extracting diverse types of information about dietary supplements (DSs) from clinical records.
Methods: We focused on 4 core DS information extraction tasks: named entity recognition (2 949 clinical sentences), relation extraction (4 892 sentences), triple extraction (2 949 sentences), and usage classification (2 460 sentences). To address these tasks, we introduced the retrieval-augmented multi-task information extraction (RAMIE) framework, which incorporates: (1) instruction fine-tuning with task-specific prompts; (2) multi-task training of LLMs to enhance storage efficiency and reduce training costs; and (3) retrieval-augmented generation, which retrieves similar examples from the training set to improve task performance. We compared the performance of RAMIE to LLMs with instruction fine-tuning alone and conducted an ablation study to evaluate the individual contributions of multi-task learning and retrieval-augmented generation to overall performance improvements.
Results: Using the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 on the named entity recognition task, reflecting a 3.51% improvement. It also excelled in the relation extraction task with an F1 score of 93.74, a 1.15% improvement. For the triple extraction task, Llama2-7B achieved an F1 score of 79.45, representing a significant 14.26% improvement. MedAlpaca-7B delivered the highest F1 score of 93.45 on the usage classification task, with a 0.94% improvement. The ablation study highlighted that while multi-task learning improved efficiency with a minor trade-off in performance, the inclusion of retrieval-augmented generation significantly enhanced overall accuracy across tasks.
Conclusion: The RAMIE framework demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records.
{"title":"RAMIE: retrieval-augmented multi-task information extraction with large language models on dietary supplements.","authors":"Zaifu Zhan, Shuang Zhou, Mingchen Li, Rui Zhang","doi":"10.1093/jamia/ocaf002","DOIUrl":"https://doi.org/10.1093/jamia/ocaf002","url":null,"abstract":"<p><strong>Objective: </strong>To develop an advanced multi-task large language model (LLM) framework for extracting diverse types of information about dietary supplements (DSs) from clinical records.</p><p><strong>Methods: </strong>We focused on 4 core DS information extraction tasks: named entity recognition (2 949 clinical sentences), relation extraction (4 892 sentences), triple extraction (2 949 sentences), and usage classification (2 460 sentences). To address these tasks, we introduced the retrieval-augmented multi-task information extraction (RAMIE) framework, which incorporates: (1) instruction fine-tuning with task-specific prompts; (2) multi-task training of LLMs to enhance storage efficiency and reduce training costs; and (3) retrieval-augmented generation, which retrieves similar examples from the training set to improve task performance. We compared the performance of RAMIE to LLMs with instruction fine-tuning alone and conducted an ablation study to evaluate the individual contributions of multi-task learning and retrieval-augmented generation to overall performance improvements.</p><p><strong>Results: </strong>Using the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 on the named entity recognition task, reflecting a 3.51% improvement. It also excelled in the relation extraction task with an F1 score of 93.74, a 1.15% improvement. For the triple extraction task, Llama2-7B achieved an F1 score of 79.45, representing a significant 14.26% improvement. MedAlpaca-7B delivered the highest F1 score of 93.45 on the usage classification task, with a 0.94% improvement. The ablation study highlighted that while multi-task learning improved efficiency with a minor trade-off in performance, the inclusion of retrieval-augmented generation significantly enhanced overall accuracy across tasks.</p><p><strong>Conclusion: </strong>The RAMIE framework demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susan H Fenton, Cassandra Ciminello, Vickie M Mays, Mary H Stanfill, Valerie Watzlaf
Objective: The ICD-10-CM classification system contains more specificity than its predecessor ICD-9-CM. A stated reason for transitioning to ICD-10-CM was to increase the availability of detailed data. This study aims to determine whether the increased specificity contained in ICD-10-CM is utilized in the ambulatory care setting and inform an evidence-based approach to evaluate ICD-11 content for implementation planning in the United States.
Materials and methods: Diagnosis codes and text descriptions were extracted from a 25% random sample of the IQVIA Ambulatory EMR-US database for 2014 (ICD-9-CM, n = 14 327 155) and 2019 (ICD-10-CM, n = 13 062 900). Code utilization data was analyzed for the total and unique number of codes. Frequencies and tests of significance determined the percentage of available codes utilized and the unspecified code rates for both code sets in each year.
Results: Only 44.6% of available ICD-10-CM codes were used compared to 91.5% of available ICD-9-CM codes. Of the total codes used, 14.5% ICD-9-CM codes were unspecified, while 33.3% ICD-10-CM codes were unspecified.
Discussion: Even though greater detail is available, a 108.5% increase in using unspecified codes with ICD-10-CM was found. The utilization data analyzed in this study does not support a rationale for the large increase in the number of codes in ICD-10-CM. New technologies and methods are likely needed to fully utilize detailed classification systems.
Conclusion: These results help evaluate the content needed in the United States national ICD standard. This analysis of codes in the current ICD standard is important for ICD-11 evaluation, implementation, and use.
目的:ICD-10-CM分类系统比其前身ICD-9-CM更具特异性。向ICD-10-CM过渡的一个明确原因是增加详细数据的可用性。本研究旨在确定ICD-10-CM中增加的特异性是否用于门诊护理环境,并为美国实施计划评估ICD-11内容的循证方法提供信息。材料和方法:从IQVIA动态EMR-US数据库2014年(ICD-9-CM, n = 14 327 155)和2019年(ICD-10-CM, n = 13 062 900)的25%随机样本中提取诊断代码和文本描述。对代码使用数据进行了分析,以确定代码的总数和唯一数量。频率和显著性测试决定了每年使用的可用代码的百分比和两个代码集的未指定代码率。结果:ICD-10-CM编码的使用率为44.6%,而ICD-9-CM编码的使用率为91.5%。在使用的全部编码中,14.5%的ICD-9-CM编码未明确,33.3%的ICD-10-CM编码未明确。讨论:尽管有更多的细节,但发现使用未指定代码的ICD-10-CM增加了108.5%。本研究分析的利用数据不支持ICD-10-CM中代码数量大量增加的基本原理。可能需要新的技术和方法来充分利用详细的分类系统。结论:这些结果有助于评估美国ICD国家标准所需的内容。对现行ICD标准中代码的分析对于ICD-11的评估、实施和使用非常重要。
{"title":"An examination of ambulatory care code specificity utilization in ICD-10-CM compared to ICD-9-CM: implications for ICD-11 implementation.","authors":"Susan H Fenton, Cassandra Ciminello, Vickie M Mays, Mary H Stanfill, Valerie Watzlaf","doi":"10.1093/jamia/ocaf003","DOIUrl":"https://doi.org/10.1093/jamia/ocaf003","url":null,"abstract":"<p><strong>Objective: </strong>The ICD-10-CM classification system contains more specificity than its predecessor ICD-9-CM. A stated reason for transitioning to ICD-10-CM was to increase the availability of detailed data. This study aims to determine whether the increased specificity contained in ICD-10-CM is utilized in the ambulatory care setting and inform an evidence-based approach to evaluate ICD-11 content for implementation planning in the United States.</p><p><strong>Materials and methods: </strong>Diagnosis codes and text descriptions were extracted from a 25% random sample of the IQVIA Ambulatory EMR-US database for 2014 (ICD-9-CM, n = 14 327 155) and 2019 (ICD-10-CM, n = 13 062 900). Code utilization data was analyzed for the total and unique number of codes. Frequencies and tests of significance determined the percentage of available codes utilized and the unspecified code rates for both code sets in each year.</p><p><strong>Results: </strong>Only 44.6% of available ICD-10-CM codes were used compared to 91.5% of available ICD-9-CM codes. Of the total codes used, 14.5% ICD-9-CM codes were unspecified, while 33.3% ICD-10-CM codes were unspecified.</p><p><strong>Discussion: </strong>Even though greater detail is available, a 108.5% increase in using unspecified codes with ICD-10-CM was found. The utilization data analyzed in this study does not support a rationale for the large increase in the number of codes in ICD-10-CM. New technologies and methods are likely needed to fully utilize detailed classification systems.</p><p><strong>Conclusion: </strong>These results help evaluate the content needed in the United States national ICD standard. This analysis of codes in the current ICD standard is important for ICD-11 evaluation, implementation, and use.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitchell M Conover, Patrick B Ryan, Yong Chen, Marc A Suchard, George Hripcsak, Martijn J Schuemie
Objective: Propose a framework to empirically evaluate and report validity of findings from observational studies using pre-specified objective diagnostics, increasing trust in real-world evidence (RWE).
Materials and methods: The framework employs objective diagnostic measures to assess the appropriateness of study designs, analytic assumptions, and threats to validity in generating reliable evidence addressing causal questions. Diagnostic evaluations should be interpreted before the unblinding of study results or, alternatively, only unblind results from analyses that pass pre-specified thresholds. We provide a conceptual overview of objective diagnostic measures and demonstrate their impact on the validity of RWE from a large-scale comparative new-user study of various antihypertensive medications. We evaluated expected absolute systematic error (EASE) before and after applying diagnostic thresholds, using a large set of negative control outcomes.
Results: Applying objective diagnostics reduces bias and improves evidence reliability in observational studies. Among 11 716 analyses (EASE = 0.38), 13.9% met pre-specified diagnostic thresholds which reduced EASE to zero. Objective diagnostics provide a comprehensive and empirical set of tests that increase confidence when passed and raise doubts when failed.
Discussion: The increasing use of real-world data presents a scientific opportunity; however, the complexity of the evidence generation process poses challenges for understanding study validity and trusting RWE. Deploying objective diagnostics is crucial to reducing bias and improving reliability in RWE generation. Under ideal conditions, multiple study designs pass diagnostics and generate consistent results, deepening understanding of causal relationships. Open-source, standardized programs can facilitate implementation of diagnostic analyses.
Conclusion: Objective diagnostics are a valuable addition to the RWE generation process.
{"title":"Objective study validity diagnostics: a framework requiring pre-specified, empirical verification to increase trust in the reliability of real-world evidence.","authors":"Mitchell M Conover, Patrick B Ryan, Yong Chen, Marc A Suchard, George Hripcsak, Martijn J Schuemie","doi":"10.1093/jamia/ocae317","DOIUrl":"https://doi.org/10.1093/jamia/ocae317","url":null,"abstract":"<p><strong>Objective: </strong>Propose a framework to empirically evaluate and report validity of findings from observational studies using pre-specified objective diagnostics, increasing trust in real-world evidence (RWE).</p><p><strong>Materials and methods: </strong>The framework employs objective diagnostic measures to assess the appropriateness of study designs, analytic assumptions, and threats to validity in generating reliable evidence addressing causal questions. Diagnostic evaluations should be interpreted before the unblinding of study results or, alternatively, only unblind results from analyses that pass pre-specified thresholds. We provide a conceptual overview of objective diagnostic measures and demonstrate their impact on the validity of RWE from a large-scale comparative new-user study of various antihypertensive medications. We evaluated expected absolute systematic error (EASE) before and after applying diagnostic thresholds, using a large set of negative control outcomes.</p><p><strong>Results: </strong>Applying objective diagnostics reduces bias and improves evidence reliability in observational studies. Among 11 716 analyses (EASE = 0.38), 13.9% met pre-specified diagnostic thresholds which reduced EASE to zero. Objective diagnostics provide a comprehensive and empirical set of tests that increase confidence when passed and raise doubts when failed.</p><p><strong>Discussion: </strong>The increasing use of real-world data presents a scientific opportunity; however, the complexity of the evidence generation process poses challenges for understanding study validity and trusting RWE. Deploying objective diagnostics is crucial to reducing bias and improving reliability in RWE generation. Under ideal conditions, multiple study designs pass diagnostics and generate consistent results, deepening understanding of causal relationships. Open-source, standardized programs can facilitate implementation of diagnostic analyses.</p><p><strong>Conclusion: </strong>Objective diagnostics are a valuable addition to the RWE generation process.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142957972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng
Objectives: This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.
Materials and methods: Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.
Results: Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.
Discussion: Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.
Conclusion: This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.
{"title":"Smart Imitator: Learning from Imperfect Clinical Decisions.","authors":"Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng","doi":"10.1093/jamia/ocae320","DOIUrl":"https://doi.org/10.1093/jamia/ocae320","url":null,"abstract":"<p><strong>Objectives: </strong>This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.</p><p><strong>Materials and methods: </strong>Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.</p><p><strong>Results: </strong>Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.</p><p><strong>Discussion: </strong>Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.</p><p><strong>Conclusion: </strong>This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahram Yazdani, Ronald Claude Henry, Avery Byrne, Isaac Claude Henry
Objective: This study evaluates the utility of word embeddings, generated by large language models (LLMs), for medical diagnosis by comparing the semantic proximity of symptoms to their eponymic disease embedding ("eponymic condition") and the mean of all symptom embeddings associated with a disease ("ensemble mean").
Materials and methods: Symptom data for 5 diagnostically challenging pediatric diseases-CHARGE syndrome, Cowden disease, POEMS syndrome, Rheumatic fever, and Tuberous sclerosis-were collected from PubMed. Using the Ada-002 embedding model, disease names and symptoms were translated into vector representations in a high-dimensional space. Euclidean and Chebyshev distance metrics were used to classify symptoms based on their proximity to both the eponymic condition and the ensemble mean of the condition's symptoms.
Results: The ensemble mean approach showed significantly higher classification accuracy, correctly classifying between 80% (Cowden disease) to 100% (Tuberous sclerosis) of the sample disease symptoms using the Euclidean distance metric. In contrast, the eponymic condition approach using Euclidian distance metric and Chebyshev distances, in general, showed poor symptom classification performance, with erratic results (0%-100% accuracy), largely ranging between 0% and 3% accuracy.
Discussion: The ensemble mean captures a disease's collective symptom profile, providing a more nuanced representation than the disease name alone. However, some misclassifications were due to superficial semantic similarities, highlighting the need for LLM models trained on medical corpora.
Conclusion: The ensemble mean of symptom embeddings improves classification accuracy over the eponymic condition approach. Future efforts should focus on medical-specific training of LLMs to enhance their diagnostic accuracy and clinical utility.
{"title":"Utility of word embeddings from large language models in medical diagnosis.","authors":"Shahram Yazdani, Ronald Claude Henry, Avery Byrne, Isaac Claude Henry","doi":"10.1093/jamia/ocae314","DOIUrl":"https://doi.org/10.1093/jamia/ocae314","url":null,"abstract":"<p><strong>Objective: </strong>This study evaluates the utility of word embeddings, generated by large language models (LLMs), for medical diagnosis by comparing the semantic proximity of symptoms to their eponymic disease embedding (\"eponymic condition\") and the mean of all symptom embeddings associated with a disease (\"ensemble mean\").</p><p><strong>Materials and methods: </strong>Symptom data for 5 diagnostically challenging pediatric diseases-CHARGE syndrome, Cowden disease, POEMS syndrome, Rheumatic fever, and Tuberous sclerosis-were collected from PubMed. Using the Ada-002 embedding model, disease names and symptoms were translated into vector representations in a high-dimensional space. Euclidean and Chebyshev distance metrics were used to classify symptoms based on their proximity to both the eponymic condition and the ensemble mean of the condition's symptoms.</p><p><strong>Results: </strong>The ensemble mean approach showed significantly higher classification accuracy, correctly classifying between 80% (Cowden disease) to 100% (Tuberous sclerosis) of the sample disease symptoms using the Euclidean distance metric. In contrast, the eponymic condition approach using Euclidian distance metric and Chebyshev distances, in general, showed poor symptom classification performance, with erratic results (0%-100% accuracy), largely ranging between 0% and 3% accuracy.</p><p><strong>Discussion: </strong>The ensemble mean captures a disease's collective symptom profile, providing a more nuanced representation than the disease name alone. However, some misclassifications were due to superficial semantic similarities, highlighting the need for LLM models trained on medical corpora.</p><p><strong>Conclusion: </strong>The ensemble mean of symptom embeddings improves classification accuracy over the eponymic condition approach. Future efforts should focus on medical-specific training of LLMs to enhance their diagnostic accuracy and clinical utility.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142958004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shalmali Joshi, Iñigo Urteaga, Wouter A C van Amsterdam, George Hripcsak, Pierre Elias, Benjamin Recht, Noémie Elhadad, James Fackler, Mark P Sendak, Jenna Wiens, Kaivalya Deshpande, Yoav Wald, Madalina Fiterau, Zachary Lipton, Daniel Malinsky, Madhur Nayan, Hongseok Namkoong, Soojin Park, Julia E Vogt, Rajesh Ranganath
The primary practice of healthcare artificial intelligence (AI) starts with model development, often using state-of-the-art AI, retrospectively evaluated using metrics lifted from the AI literature like AUROC and DICE score. However, good performance on these metrics may not translate to improved clinical outcomes. Instead, we argue for a better development pipeline constructed by working backward from the end goal of positively impacting clinically relevant outcomes using AI, leading to considerations of causality in model development and validation, and subsequently a better development pipeline. Healthcare AI should be "actionable," and the change in actions induced by AI should improve outcomes. Quantifying the effect of changes in actions on outcomes is causal inference. The development, evaluation, and validation of healthcare AI should therefore account for the causal effect of intervening with the AI on clinically relevant outcomes. Using a causal lens, we make recommendations for key stakeholders at various stages of the healthcare AI pipeline. Our recommendations aim to increase the positive impact of AI on clinical outcomes.
{"title":"AI as an intervention: improving clinical outcomes relies on a causal approach to AI development and validation.","authors":"Shalmali Joshi, Iñigo Urteaga, Wouter A C van Amsterdam, George Hripcsak, Pierre Elias, Benjamin Recht, Noémie Elhadad, James Fackler, Mark P Sendak, Jenna Wiens, Kaivalya Deshpande, Yoav Wald, Madalina Fiterau, Zachary Lipton, Daniel Malinsky, Madhur Nayan, Hongseok Namkoong, Soojin Park, Julia E Vogt, Rajesh Ranganath","doi":"10.1093/jamia/ocae301","DOIUrl":"https://doi.org/10.1093/jamia/ocae301","url":null,"abstract":"<p><p>The primary practice of healthcare artificial intelligence (AI) starts with model development, often using state-of-the-art AI, retrospectively evaluated using metrics lifted from the AI literature like AUROC and DICE score. However, good performance on these metrics may not translate to improved clinical outcomes. Instead, we argue for a better development pipeline constructed by working backward from the end goal of positively impacting clinically relevant outcomes using AI, leading to considerations of causality in model development and validation, and subsequently a better development pipeline. Healthcare AI should be \"actionable,\" and the change in actions induced by AI should improve outcomes. Quantifying the effect of changes in actions on outcomes is causal inference. The development, evaluation, and validation of healthcare AI should therefore account for the causal effect of intervening with the AI on clinically relevant outcomes. Using a causal lens, we make recommendations for key stakeholders at various stages of the healthcare AI pipeline. Our recommendations aim to increase the positive impact of AI on clinical outcomes.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142957971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabienne C Bourgeois, Amrita Sinha, Gaurav Tuli, Marvin B Harper, Virginia K Robbins, Sydney Jeffrey, John S Brownstein, Shahla M Jilani
Objective: Timely access to data is needed to improve care for substance-exposed birthing persons and their infants, a significant public health problem in the United States. We examined the current state of birthing person and infant/child (dyad) data-sharing capabilities supported by health information exchange (HIE) standards and HIE network capabilities for data exchange to inform point-of-care needs assessment for the substance-exposed dyad.
Material and methods: A cross-map analysis was performed using a set of dyadic data elements focused on pediatric development and longitudinal supportive care for substance-exposed dyads (70 birthing person and 110 infant/child elements). Cross-mapping was conducted to identify definitional alignment to standardized data fields within national healthcare data exchange standards, the United States Core Data for Interoperability (USCDI) version 4 (v4) and Fast Healthcare Interoperability Resources (FHIR) release 4 (R4), and applicable structured vocabulary standards or terminology associated with USCDI. Subsequent survey analysis examined representative HIE network sharing capabilities, focusing on USCDI and FHIR usage.
Results: 91.11% of dyadic data elements cross-mapped to at least 1 USCDI v4 standardized data field (87.80% of those structured) and 88.89% to FHIR R4. 75% of the surveyed HIE networks reported supporting USCDI versions 1 or 2 and the capability to use FHIR, though demand is limited.
Discussion: HIE of clinical and supportive care data for substance-exposed dyads is supported by current national standards, though limitations exist.
Conclusion: These findings offer a dyadic-focused framework for electronic health record-centered data exchange to inform bedside care longitudinally across clinical touchpoints and population-level health.
{"title":"The substance-exposed birthing person-infant/child dyad and health information exchange in the United States.","authors":"Fabienne C Bourgeois, Amrita Sinha, Gaurav Tuli, Marvin B Harper, Virginia K Robbins, Sydney Jeffrey, John S Brownstein, Shahla M Jilani","doi":"10.1093/jamia/ocae315","DOIUrl":"https://doi.org/10.1093/jamia/ocae315","url":null,"abstract":"<p><strong>Objective: </strong>Timely access to data is needed to improve care for substance-exposed birthing persons and their infants, a significant public health problem in the United States. We examined the current state of birthing person and infant/child (dyad) data-sharing capabilities supported by health information exchange (HIE) standards and HIE network capabilities for data exchange to inform point-of-care needs assessment for the substance-exposed dyad.</p><p><strong>Material and methods: </strong>A cross-map analysis was performed using a set of dyadic data elements focused on pediatric development and longitudinal supportive care for substance-exposed dyads (70 birthing person and 110 infant/child elements). Cross-mapping was conducted to identify definitional alignment to standardized data fields within national healthcare data exchange standards, the United States Core Data for Interoperability (USCDI) version 4 (v4) and Fast Healthcare Interoperability Resources (FHIR) release 4 (R4), and applicable structured vocabulary standards or terminology associated with USCDI. Subsequent survey analysis examined representative HIE network sharing capabilities, focusing on USCDI and FHIR usage.</p><p><strong>Results: </strong>91.11% of dyadic data elements cross-mapped to at least 1 USCDI v4 standardized data field (87.80% of those structured) and 88.89% to FHIR R4. 75% of the surveyed HIE networks reported supporting USCDI versions 1 or 2 and the capability to use FHIR, though demand is limited.</p><p><strong>Discussion: </strong>HIE of clinical and supportive care data for substance-exposed dyads is supported by current national standards, though limitations exist.</p><p><strong>Conclusion: </strong>These findings offer a dyadic-focused framework for electronic health record-centered data exchange to inform bedside care longitudinally across clinical touchpoints and population-level health.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142923823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aubrey Limburg, Nicole Gladish, David H Rehkopf, Robert L Phillips, Victoria Udalova
Objectives: To evaluate the likelihood of linking electronic health records (EHRs) to restricted individual-level American Community Survey (ACS) data based on patient health condition.
Materials and methods: Electronic health records (2019-2021) are derived from a primary care registry collected by the American Board of Family Medicine. These data were assigned anonymized person-level identifiers (Protected Identification Keys [PIKs]) at the U.S. Census Bureau. These records were then linked to restricted individual-level data from the ACS (2005-2022). We used logistic regressions to evaluate match rates for patients with health conditions across a range of severity: hypertension, diabetes, and chronic kidney disease.
Results: Among more than 2.8 million patients, 99.2% were assigned person-level identifiers (PIKs). There were some differences in the odds of receiving an identifier in adjusted models for patients with hypertension (OR = 1.70, 95% CI: 1.63, 1.77) and diabetes (OR = 1.17, 95% CI: 1.13, 1.22), relative to those without. There were only small differences in the odds of matching to ACS in adjusted models for patients with hypertension (OR = 1.03, 95% CI: 1.03, 1.04), diabetes (OR = 1.02, 95% CI: 1.01, 1.03), and chronic kidney disease (OR = 1.05, 95% CI: 1.03, 1.06), relative to those without.
Discussion and conclusion: Our work supports evidence-building across government consistent with the Foundations for Evidence-Based Policymaking Act of 2018 and the goal of leveraging data as a strategic asset. Given the high PIK and ACS match rates, with small differences based on health condition, our findings suggest the feasibility of enhancing the utility of EHR data for research focused on health.
{"title":"Linking national primary care electronic health records to individual records from the U.S. Census Bureau's American Community Survey: evaluating the likelihood of linkage based on patient health.","authors":"Aubrey Limburg, Nicole Gladish, David H Rehkopf, Robert L Phillips, Victoria Udalova","doi":"10.1093/jamia/ocae269","DOIUrl":"10.1093/jamia/ocae269","url":null,"abstract":"<p><strong>Objectives: </strong>To evaluate the likelihood of linking electronic health records (EHRs) to restricted individual-level American Community Survey (ACS) data based on patient health condition.</p><p><strong>Materials and methods: </strong>Electronic health records (2019-2021) are derived from a primary care registry collected by the American Board of Family Medicine. These data were assigned anonymized person-level identifiers (Protected Identification Keys [PIKs]) at the U.S. Census Bureau. These records were then linked to restricted individual-level data from the ACS (2005-2022). We used logistic regressions to evaluate match rates for patients with health conditions across a range of severity: hypertension, diabetes, and chronic kidney disease.</p><p><strong>Results: </strong>Among more than 2.8 million patients, 99.2% were assigned person-level identifiers (PIKs). There were some differences in the odds of receiving an identifier in adjusted models for patients with hypertension (OR = 1.70, 95% CI: 1.63, 1.77) and diabetes (OR = 1.17, 95% CI: 1.13, 1.22), relative to those without. There were only small differences in the odds of matching to ACS in adjusted models for patients with hypertension (OR = 1.03, 95% CI: 1.03, 1.04), diabetes (OR = 1.02, 95% CI: 1.01, 1.03), and chronic kidney disease (OR = 1.05, 95% CI: 1.03, 1.06), relative to those without.</p><p><strong>Discussion and conclusion: </strong>Our work supports evidence-building across government consistent with the Foundations for Evidence-Based Policymaking Act of 2018 and the goal of leveraging data as a strategic asset. Given the high PIK and ACS match rates, with small differences based on health condition, our findings suggest the feasibility of enhancing the utility of EHR data for research focused on health.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"97-104"},"PeriodicalIF":4.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11648727/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Trujeque, R Adams Dudley, Nathan Mesfin, Nicholas E Ingraham, Isai Ortiz, Ann Bangerter, Anjan Chakraborty, Dalton Schutte, Jeremy Yeung, Ying Liu, Alicia Woodward-Abel, Emma Bromley, Rui Zhang, Lisa A Brenner, Joseph A Simonetti
Objective: Access to firearms is associated with increased suicide risk. Our aim was to develop a natural language processing approach to characterizing firearm access in clinical records.
Materials and methods: We used clinical notes from 36 685 Veterans Health Administration (VHA) patients between April 10, 2023 and April 10, 2024. We expanded preexisting firearm term sets using subject matter experts and generated 250-character snippets around each firearm term appearing in notes. Annotators labeled 3000 snippets into three classes. Using these annotated snippets, we compared four nonneural machine learning models (random forest, bagging, gradient boosting, logistic regression with ridge penalization) and two versions of Bidirectional Encoder Representations from Transformers, or BERT (specifically, BioBERT and Bio-ClinicalBERT) for classifying firearm access as "definite access", "definitely no access", or "other".
Results: Firearm terms were identified in 36 685 patient records (41.3%), 33.7% of snippets were categorized as definite access, 9.0% as definitely no access, and 57.2% as "other". Among models classifying firearm access, five of six had acceptable performance, with BioBERT and Bio-ClinicalBERT performing best, with F1s of 0.876 (95% confidence interval, 0.874-0.879) and 0.896 (95% confidence interval, 0.894-0.899), respectively.
Discussion and conclusion: Firearm-related terminology is common in the clinical records of VHA patients. The ability to use text to identify and characterize patients' firearm access could enhance suicide prevention efforts, and five of our six models could be used to identify patients for clinical interventions.
{"title":"Comparison of six natural language processing approaches to assessing firearm access in Veterans Health Administration electronic health records.","authors":"Joshua Trujeque, R Adams Dudley, Nathan Mesfin, Nicholas E Ingraham, Isai Ortiz, Ann Bangerter, Anjan Chakraborty, Dalton Schutte, Jeremy Yeung, Ying Liu, Alicia Woodward-Abel, Emma Bromley, Rui Zhang, Lisa A Brenner, Joseph A Simonetti","doi":"10.1093/jamia/ocae169","DOIUrl":"10.1093/jamia/ocae169","url":null,"abstract":"<p><strong>Objective: </strong>Access to firearms is associated with increased suicide risk. Our aim was to develop a natural language processing approach to characterizing firearm access in clinical records.</p><p><strong>Materials and methods: </strong>We used clinical notes from 36 685 Veterans Health Administration (VHA) patients between April 10, 2023 and April 10, 2024. We expanded preexisting firearm term sets using subject matter experts and generated 250-character snippets around each firearm term appearing in notes. Annotators labeled 3000 snippets into three classes. Using these annotated snippets, we compared four nonneural machine learning models (random forest, bagging, gradient boosting, logistic regression with ridge penalization) and two versions of Bidirectional Encoder Representations from Transformers, or BERT (specifically, BioBERT and Bio-ClinicalBERT) for classifying firearm access as \"definite access\", \"definitely no access\", or \"other\".</p><p><strong>Results: </strong>Firearm terms were identified in 36 685 patient records (41.3%), 33.7% of snippets were categorized as definite access, 9.0% as definitely no access, and 57.2% as \"other\". Among models classifying firearm access, five of six had acceptable performance, with BioBERT and Bio-ClinicalBERT performing best, with F1s of 0.876 (95% confidence interval, 0.874-0.879) and 0.896 (95% confidence interval, 0.894-0.899), respectively.</p><p><strong>Discussion and conclusion: </strong>Firearm-related terminology is common in the clinical records of VHA patients. The ability to use text to identify and characterize patients' firearm access could enhance suicide prevention efforts, and five of our six models could be used to identify patients for clinical interventions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"113-118"},"PeriodicalIF":4.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11648724/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Betina Idnay, Gongbo Zhang, Fangyi Chen, Casey N Ta, Matthew W Schelke, Karen Marder, Chunhua Weng
Objective: This study aims to automate the prediction of Mini-Mental State Examination (MMSE) scores, a widely adopted standard for cognitive assessment in patients with Alzheimer's disease, using natural language processing (NLP) and machine learning (ML) on structured and unstructured EHR data.
Materials and methods: We extracted demographic data, diagnoses, medications, and unstructured clinical visit notes from the EHRs. We used Latent Dirichlet Allocation (LDA) for topic modeling and Term-Frequency Inverse Document Frequency (TF-IDF) for n-grams. In addition, we extracted meta-features such as age, ethnicity, and race. Model training and evaluation employed eXtreme Gradient Boosting (XGBoost), Stochastic Gradient Descent Regressor (SGDRegressor), and Multi-Layer Perceptron (MLP).
Results: We analyzed 1654 clinical visit notes collected between September 2019 and June 2023 for 1000 Alzheimer's disease patients. The average MMSE score was 20, with patients averaging 76.4 years old, 54.7% female, and 54.7% identifying as White. The best-performing model (ie, lowest root mean squared error (RMSE)) is MLP, which achieved an RMSE of 5.53 on the validation set using n-grams, indicating superior prediction performance over other models and feature sets. The RMSE on the test set was 5.85.
Discussion: This study developed a ML method to predict MMSE scores from unstructured clinical notes, demonstrating the feasibility of utilizing NLP to support cognitive assessment. Future work should focus on refining the model and evaluating its clinical relevance across diverse settings.
Conclusion: We contributed a model for automating MMSE estimation using EHR features, potentially transforming cognitive assessment for Alzheimer's patients and paving the way for more informed clinical decisions and cohort identification.
{"title":"Mini-mental status examination phenotyping for Alzheimer's disease patients using both structured and narrative electronic health record features.","authors":"Betina Idnay, Gongbo Zhang, Fangyi Chen, Casey N Ta, Matthew W Schelke, Karen Marder, Chunhua Weng","doi":"10.1093/jamia/ocae274","DOIUrl":"10.1093/jamia/ocae274","url":null,"abstract":"<p><strong>Objective: </strong>This study aims to automate the prediction of Mini-Mental State Examination (MMSE) scores, a widely adopted standard for cognitive assessment in patients with Alzheimer's disease, using natural language processing (NLP) and machine learning (ML) on structured and unstructured EHR data.</p><p><strong>Materials and methods: </strong>We extracted demographic data, diagnoses, medications, and unstructured clinical visit notes from the EHRs. We used Latent Dirichlet Allocation (LDA) for topic modeling and Term-Frequency Inverse Document Frequency (TF-IDF) for n-grams. In addition, we extracted meta-features such as age, ethnicity, and race. Model training and evaluation employed eXtreme Gradient Boosting (XGBoost), Stochastic Gradient Descent Regressor (SGDRegressor), and Multi-Layer Perceptron (MLP).</p><p><strong>Results: </strong>We analyzed 1654 clinical visit notes collected between September 2019 and June 2023 for 1000 Alzheimer's disease patients. The average MMSE score was 20, with patients averaging 76.4 years old, 54.7% female, and 54.7% identifying as White. The best-performing model (ie, lowest root mean squared error (RMSE)) is MLP, which achieved an RMSE of 5.53 on the validation set using n-grams, indicating superior prediction performance over other models and feature sets. The RMSE on the test set was 5.85.</p><p><strong>Discussion: </strong>This study developed a ML method to predict MMSE scores from unstructured clinical notes, demonstrating the feasibility of utilizing NLP to support cognitive assessment. Future work should focus on refining the model and evaluating its clinical relevance across diverse settings.</p><p><strong>Conclusion: </strong>We contributed a model for automating MMSE estimation using EHR features, potentially transforming cognitive assessment for Alzheimer's patients and paving the way for more informed clinical decisions and cohort identification.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"119-128"},"PeriodicalIF":4.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11648712/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}