首页 > 最新文献

JAMIA Open最新文献

英文 中文
Evaluating the impact of data biases on algorithmic fairness and clinical utility of machine learning models for prolonged opioid use prediction. 评估数据偏差对算法公平性和机器学习模型用于阿片类药物长期使用预测的临床效用的影响。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-09-30 eCollection Date: 2025-10-01 DOI: 10.1093/jamiaopen/ooaf115
Behzad Naderalvojoud, Catherine Curtin, Steven M Asch, Keith Humphreys, Tina Hernandez-Boussard

Objectives: The growing use of machine learning (ML) in healthcare raises concerns about how data biases affect real-world model performance. While existing frameworks evaluate algorithmic fairness, they often overlook the impact of bias on generalizability and clinical utility, which are critical for safe deployment. Building on prior methods, this study extends bias analysis to include clinical utility, addressing a key gap between fairness evaluation and decision-making.

Materials and methods: We applied a 3-phase evaluation to a previously developed model predicting prolonged opioid use (POU), validated on Veterans Health Administration (VHA) data. The analysis included internal and external validation, model retraining on VHA data, and subgroup evaluation across demographic, vulnerable, risk, and comorbidity groups. We assessed performance using area under the receiver operating characteristic curve (AUROC), calibration, and decision curve analysis, incorporating standardized net-benefits to evaluate clinical utility alongside fairness and generalizability.

Results: The internal cohort (N = 41 929) had a 14.7% POU prevalence, compared to 34.3% in the external VHA cohort (N = 397 150). The model's AUROC decreased from 0.74 in the internal test cohort to 0.70 in the full external cohort. Subgroup-level performance averaged 0.69 (SD = 0.01), showing minimal deviation from the external cohort overall. Retraining on VHA data improved AUROCs to 0.82. Clinical utility analysis showed systematic shifts in net-benefit across threshold probabilities.

Discussion: While the POU model showed generalizability and fairness internally, external validation and retraining revealed performance and utility shifts across subgroups.

Conclusion: Population-specific biases affect clinical utility-an often-overlooked dimension in fairness evaluation-a key need to ensure equitable benefits across diverse patient groups.

目的:机器学习(ML)在医疗保健领域的日益普及引发了人们对数据偏差如何影响现实世界模型性能的担忧。虽然现有的框架评估算法的公平性,但它们往往忽略了偏见对概括性和临床效用的影响,这对安全部署至关重要。在先前方法的基础上,本研究将偏倚分析扩展到包括临床效用,解决公平评估和决策之间的关键差距。材料和方法:我们对先前开发的预测阿片类药物长期使用(POU)的模型进行了三个阶段的评估,该模型在退伍军人健康管理局(VHA)的数据上得到了验证。分析包括内部和外部验证,VHA数据的模型再训练,以及人口统计学,易感,风险和合并症组的亚组评估。我们使用受试者工作特征曲线下面积(AUROC)、校准和决策曲线分析来评估效果,结合标准化净效益来评估临床效用、公平性和普遍性。结果:内部队列(N = 41 929)的POU患病率为14.7%,而外部VHA队列(N = 397 150)的POU患病率为34.3%。该模型的AUROC从内部测试队列的0.74下降到完整外部队列的0.70。亚组水平的平均表现为0.69 (SD = 0.01),与外部队列的总体偏差最小。对VHA数据的再训练将auroc提高到0.82。临床效用分析显示净效益在阈值概率上的系统性变化。讨论:虽然POU模型在内部显示出普遍性和公平性,但外部验证和再培训揭示了子组之间的性能和效用变化。结论:人群特异性偏倚影响临床效用,这是公平性评估中经常被忽视的一个维度,也是确保不同患者群体公平获益的关键需求。
{"title":"Evaluating the impact of data biases on algorithmic fairness and clinical utility of machine learning models for prolonged opioid use prediction.","authors":"Behzad Naderalvojoud, Catherine Curtin, Steven M Asch, Keith Humphreys, Tina Hernandez-Boussard","doi":"10.1093/jamiaopen/ooaf115","DOIUrl":"10.1093/jamiaopen/ooaf115","url":null,"abstract":"<p><strong>Objectives: </strong>The growing use of machine learning (ML) in healthcare raises concerns about how data biases affect real-world model performance. While existing frameworks evaluate algorithmic fairness, they often overlook the impact of bias on generalizability and clinical utility, which are critical for safe deployment. Building on prior methods, this study extends bias analysis to include clinical utility, addressing a key gap between fairness evaluation and decision-making.</p><p><strong>Materials and methods: </strong>We applied a 3-phase evaluation to a previously developed model predicting prolonged opioid use (POU), validated on Veterans Health Administration (VHA) data. The analysis included internal and external validation, model retraining on VHA data, and subgroup evaluation across demographic, vulnerable, risk, and comorbidity groups. We assessed performance using area under the receiver operating characteristic curve (AUROC), calibration, and decision curve analysis, incorporating standardized net-benefits to evaluate clinical utility alongside fairness and generalizability.</p><p><strong>Results: </strong>The internal cohort (<i>N</i> = 41 929) had a 14.7% POU prevalence, compared to 34.3% in the external VHA cohort (<i>N</i> = 397 150). The model's AUROC decreased from 0.74 in the internal test cohort to 0.70 in the full external cohort. Subgroup-level performance averaged 0.69 (SD = 0.01), showing minimal deviation from the external cohort overall. Retraining on VHA data improved AUROCs to 0.82. Clinical utility analysis showed systematic shifts in net-benefit across threshold probabilities.</p><p><strong>Discussion: </strong>While the POU model showed generalizability and fairness internally, external validation and retraining revealed performance and utility shifts across subgroups.</p><p><strong>Conclusion: </strong>Population-specific biases affect clinical utility-an often-overlooked dimension in fairness evaluation-a key need to ensure equitable benefits across diverse patient groups.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf115"},"PeriodicalIF":3.4,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12483547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145207911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Developing a real-time registry to track breast cancer patients across the city of Boston. 开发一个实时注册系统来跟踪整个波士顿市的乳腺癌患者。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-09-29 eCollection Date: 2025-10-01 DOI: 10.1093/jamiaopen/ooaf099
Amy M LeClair, Clara A Chen, Marisa L Mizzoni, William G Adams, William F Harvey, Christopher W Shanahan, Jennifer S Haas, Stephenie C Lemon, Tracy Battaglia, Karen M Freund

Objectives: Patient navigation is designed to identify and address patients' needs throughout their cancer treatment. In the context of a clinical trial designed to deliver a standardized patient navigation protocol, a registry was needed to allow users from across multiple health systems to input patient data, track navigation outreach, and coordinate cancer care in real time. To design a registry to allow patient navigators (PNs) at 6 medical centers across 4 health systems to track breast cancer patients determined to be most at risk for delays in treatment.

Materials and methods: A multi-disciplinary team chose REDCap to host the registry. The aim was to develop a platform that would (1) manage a caseload of patients who are most vulnerable for delays; (2) track patients through the continuum of cancer care in real time; (3) allow PNs to prioritize certain patients; (4) facilitate inter-system communication; and (5) allow the research team to monitor navigators' activities (in context of a research study, for supervision and feedback).

Results: The registry was built through collaboration with clinical providers, PNs, informatics specialists, and expert developers from the REDCap team, using the software standard features and incorporating additional functionality using SAS programming.

Conclusion: REDCap provided an accessible and modifiable platform for hosting a registry to track patients in real time. However, it did not streamline PNs' workflows or reduce data entry burdens as intended. A major barrier was the lack of interoperability with pre-existing systems navigators use, which led to redundancy and increased the burden of documentation.

目的:患者导航旨在识别和解决患者在整个癌症治疗过程中的需求。在旨在提供标准化患者导航协议的临床试验的背景下,需要一个注册表,以允许来自多个卫生系统的用户输入患者数据,跟踪导航扩展,并实时协调癌症护理。设计一个注册表,使4个卫生系统的6个医疗中心的患者导航员(PNs)能够跟踪被确定为最有可能延迟治疗的乳腺癌患者。材料和方法:一个多学科团队选择REDCap作为注册中心。其目的是开发一个平台,该平台将(1)管理最容易受到延误的患者;(2)通过连续的癌症治疗实时跟踪患者;(3)允许执业医师优先考虑某些患者;(4)便于系统间通信;(5)允许研究团队监控导航员的活动(在研究研究的背景下,进行监督和反馈)。结果:通过与临床提供者、PNs、信息学专家和REDCap团队的专家开发人员合作,使用软件标准特性并使用SAS编程合并其他功能,构建了该注册表。结论:REDCap提供了一个可访问且可修改的平台,用于托管注册表以实时跟踪患者。然而,它并没有像预期的那样简化pn的工作流程或减少数据输入负担。一个主要的障碍是缺乏与现有系统导航器的互操作性,这导致了冗余并增加了文档的负担。
{"title":"Developing a real-time registry to track breast cancer patients across the city of Boston.","authors":"Amy M LeClair, Clara A Chen, Marisa L Mizzoni, William G Adams, William F Harvey, Christopher W Shanahan, Jennifer S Haas, Stephenie C Lemon, Tracy Battaglia, Karen M Freund","doi":"10.1093/jamiaopen/ooaf099","DOIUrl":"10.1093/jamiaopen/ooaf099","url":null,"abstract":"<p><strong>Objectives: </strong>Patient navigation is designed to identify and address patients' needs throughout their cancer treatment. In the context of a clinical trial designed to deliver a standardized patient navigation protocol, a registry was needed to allow users from across multiple health systems to input patient data, track navigation outreach, and coordinate cancer care in real time. To design a registry to allow patient navigators (PNs) at 6 medical centers across 4 health systems to track breast cancer patients determined to be most at risk for delays in treatment.</p><p><strong>Materials and methods: </strong>A multi-disciplinary team chose REDCap to host the registry. The aim was to develop a platform that would (1) manage a caseload of patients who are most vulnerable for delays; (2) track patients through the continuum of cancer care in real time; (3) allow PNs to prioritize certain patients; (4) facilitate inter-system communication; and (5) allow the research team to monitor navigators' activities (in context of a research study, for supervision and feedback).</p><p><strong>Results: </strong>The registry was built through collaboration with clinical providers, PNs, informatics specialists, and expert developers from the REDCap team, using the software standard features and incorporating additional functionality using SAS programming.</p><p><strong>Conclusion: </strong>REDCap provided an accessible and modifiable platform for hosting a registry to track patients in real time. However, it did not streamline PNs' workflows or reduce data entry burdens as intended. A major barrier was the lack of interoperability with pre-existing systems navigators use, which led to redundancy and increased the burden of documentation.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf099"},"PeriodicalIF":3.4,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478474/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145201668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Primary care physicians' experiences with inbox triage. 初级保健医生的收件箱分类经验。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-09-26 eCollection Date: 2025-10-01 DOI: 10.1093/jamiaopen/ooaf105
Adam Rule, Rutvi Shah, Christina Dudley, Mark A Micek, Brian G Arndt

Objective: Many primary care physicians (PCPs) feel overwhelmed by the number of electronic health record inbox messages they receive. The objective of this study was to characterize PCPs' experiences with inbox triage-the process of reviewing inbox messages and deciding when and how to address them.

Materials and methods: We conducted 3 focus groups and 1 individual interview with 9 PCPs at an academic medical center and coded the transcripts for themes related to inbox triage.

Results: We identified 5 themes in PCPs' experiences with inbox triage: (1) inbox triage is a continuous process; (2) inbox triage involves different team members performing multiple activities, including identifying messages better addressed through synchronous care, preparing messages to be reviewed by PCPs, and prioritizing messages; (3) PCPs prioritize messages based on multiple factors including clinical urgency, time constraints, and team member involvement; (4) team support for inbox triage varies by clinical experience, team stability, and co-location; and (5) patient expectations and clinic practices help make inbox triage a continuous process, requiring PCPs to establish personal policies to constrain inbox work.

Discussion: Designers of clinic workflows, healthcare policy, and health information technology should aim to support the diverse activities involved in inbox triage, message prioritization based on multiple factors, and the collaborative process of establishing and communicating messaging norms.

Conclusion: Inbox triage is a collaborative and continuous process requiring PCPs to evaluate multiple aspects of each message, find time to address those messages during busy clinic days, and negotiate different expectations for messaging behavior.

目的:许多初级保健医生(pcp)对他们收到的电子健康记录收件箱信息的数量感到不知所措。本研究的目的是描述pcp在收件箱分类方面的经验——审查收件箱信息并决定何时以及如何处理它们的过程。材料和方法:我们对一家学术医疗中心的9名pcp进行了3个焦点小组和1个个人访谈,并对与收件箱分类相关的主题进行了编码。结果:我们确定了pcp在收件箱分类方面的5个主题:(1)收件箱分类是一个连续的过程;(2)收件箱分类涉及不同团队成员执行多种活动,包括识别通过同步处理更好地处理的邮件,准备由pcp审查的邮件,以及对邮件进行优先排序;(3) pcp根据临床紧迫性、时间限制和团队成员参与等多种因素对信息进行优先排序;(4)团队对收件箱分诊的支持因临床经验、团队稳定性和同一地点而异;(5)患者期望和诊所实践有助于使收件箱分类成为一个连续的过程,要求pcp建立个人政策来限制收件箱工作。讨论:诊所工作流程、医疗保健政策和健康信息技术的设计者应该致力于支持收件箱分类、基于多种因素的消息优先级以及建立和沟通消息传递规范的协作过程中涉及的各种活动。结论:收件箱分类是一个协作和持续的过程,要求pcp评估每条消息的多个方面,在繁忙的诊所日找到时间处理这些消息,并协商对消息传递行为的不同期望。
{"title":"Primary care physicians' experiences with inbox triage.","authors":"Adam Rule, Rutvi Shah, Christina Dudley, Mark A Micek, Brian G Arndt","doi":"10.1093/jamiaopen/ooaf105","DOIUrl":"10.1093/jamiaopen/ooaf105","url":null,"abstract":"<p><strong>Objective: </strong>Many primary care physicians (PCPs) feel overwhelmed by the number of electronic health record inbox messages they receive. The objective of this study was to characterize PCPs' experiences with inbox triage-the process of reviewing inbox messages and deciding when and how to address them.</p><p><strong>Materials and methods: </strong>We conducted 3 focus groups and 1 individual interview with 9 PCPs at an academic medical center and coded the transcripts for themes related to inbox triage.</p><p><strong>Results: </strong>We identified 5 themes in PCPs' experiences with inbox triage: (1) inbox triage is a continuous process; (2) inbox triage involves different team members performing multiple activities, including identifying messages better addressed through synchronous care, preparing messages to be reviewed by PCPs, and prioritizing messages; (3) PCPs prioritize messages based on multiple factors including clinical urgency, time constraints, and team member involvement; (4) team support for inbox triage varies by clinical experience, team stability, and co-location; and (5) patient expectations and clinic practices help make inbox triage a continuous process, requiring PCPs to establish personal policies to constrain inbox work.</p><p><strong>Discussion: </strong>Designers of clinic workflows, healthcare policy, and health information technology should aim to support the diverse activities involved in inbox triage, message prioritization based on multiple factors, and the collaborative process of establishing and communicating messaging norms.</p><p><strong>Conclusion: </strong>Inbox triage is a collaborative and continuous process requiring PCPs to evaluate multiple aspects of each message, find time to address those messages during busy clinic days, and negotiate different expectations for messaging behavior.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf105"},"PeriodicalIF":3.4,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12470652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145187000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
medspacyV: a graphical user interface for the open source medspaCy natural language processing package. medspacyV:开源的medspace自然语言处理包的图形用户界面。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-23 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf094
Bharath Velamala, Elham Sagheb Hossein Pour, Michael Lin, Jungwei Wilfred Fan

Objectives: To enable users with modest technical background to perform biomedical natural language processing (NLP).

Materials and methods: We developed medspacyV using the Python graphical programming tkinter library, following the model-view-controller (MVC) design pattern. The interface wraps around a rule-based pipeline for sentence splitting, section segmentation, concept identification, and negation detection.

Results: The primary window allows the user to configure the project and NLP rules, execute the pipeline, and save the outputs into a table. A separate annotation viewer window can be launched to inspect the immediate or previous NLP outputs.

Discussion: We developed medspacyV with three rationales: controllability, explainability, and economy. The rule-based approach is sufficient for many NLP use cases.

Conclusion: The medspacyV program is publicly available at https://github.com/medspacy/medspacyV, targeting use by healthcare professionals and researchers in their NLP projects.

目的:使具有中等技术背景的用户能够进行生物医学自然语言处理(NLP)。材料和方法:我们使用Python图形编程库开发medspacyV,遵循模型-视图-控制器(MVC)设计模式。该接口封装了一个基于规则的管道,用于句子分割、分段分割、概念识别和否定检测。结果:主窗口允许用户配置项目和NLP规则,执行管道,并将输出保存到表中。可以启动一个单独的注释查看器窗口来检查当前或以前的NLP输出。讨论:我们开发medspacyV有三个基本原则:可控性、可解释性和经济性。基于规则的方法对于许多NLP用例来说是足够的。结论:medspacyV程序可以在https://github.com/medspacy/medspacyV上公开获得,目标是医疗保健专业人员和研究人员在他们的NLP项目中使用。
{"title":"medspacyV: a graphical user interface for the open source medspaCy natural language processing package.","authors":"Bharath Velamala, Elham Sagheb Hossein Pour, Michael Lin, Jungwei Wilfred Fan","doi":"10.1093/jamiaopen/ooaf094","DOIUrl":"10.1093/jamiaopen/ooaf094","url":null,"abstract":"<p><strong>Objectives: </strong>To enable users with modest technical background to perform biomedical natural language processing (NLP).</p><p><strong>Materials and methods: </strong>We developed medspacyV using the Python graphical programming tkinter library, following the model-view-controller (MVC) design pattern. The interface wraps around a rule-based pipeline for sentence splitting, section segmentation, concept identification, and negation detection.</p><p><strong>Results: </strong>The primary window allows the user to configure the project and NLP rules, execute the pipeline, and save the outputs into a table. A separate annotation viewer window can be launched to inspect the immediate or previous NLP outputs.</p><p><strong>Discussion: </strong>We developed medspacyV with three rationales: controllability, explainability, and economy. The rule-based approach is sufficient for many NLP use cases.</p><p><strong>Conclusion: </strong>The medspacyV program is publicly available at https://github.com/medspacy/medspacyV, targeting use by healthcare professionals and researchers in their NLP projects.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf094"},"PeriodicalIF":3.4,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12374723/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Defining dyadic cancer pain concordance using participant-initiated interactions with a remote health monitoring system. 使用参与者发起的与远程健康监测系统的互动来定义双癌疼痛一致性。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-22 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf088
Mina Ostovari, Natalie Crimp, Sarah J Ratcliffe, Virginia LeBaron

Background: Studies on symptom concordance between patients and their caregivers often use cross-sectional designs, which may fail to capture the longitudinal, dynamic symptom experience. The Behavioral and Environmental Sensing and Intervention for Cancer (BESI-C) is a remote health monitoring system that utilizes smartwatches and ecological momentary assessments (EMAs) to empower patients and caregivers to monitor and manage cancer pain at home. BESI-C collects real-time symptom data in naturalistic settings, enabling longitudinal tracking and analysis of symptom patterns over time.

Objective: To define and examine dyadic concordance using participant-initiated symptom reports collected via remote health monitoring.

Methods: Dyads of patients with advanced cancer and their family caregivers were recruited to use BESI-C for 2 weeks, reporting pain in real time through EMAs. We used Bangdiwala's B statistic to determine the concordance of patient-reported pain and caregiver-reported perceived patient pain under different contextual criteria (eg, co-location of participants; user engagement with BESI-C) that we hypothesized would impact concordance. We also explored a hypothesis that concordance would improve between study week 1 versus week 2.

Results: Data from 21 patient-caregiver dyads were used for analysis. The reporting of pain events was highly variable between patients and their caregivers. Concordance of pain reporting improved when patients and caregivers were co-located and both wearing their BESI-C smartwatches. We did not observe consistent patterns in patient-caregiver concordance between week 1 and week 2.

Conclusion: We propose an analytical approach to define and evaluate concordance between patients' and caregivers' real-time symptom reports that can be applied to dyadic, longitudinal symptom data collected using remote health monitoring. Future work should examine the relationship between patient-caregiver symptom concordance with key quality-of-life metrics and sociodemographic factors that impact participant engagement with remote health monitoring technologies.

背景:对患者及其照顾者症状一致性的研究通常采用横断面设计,这可能无法捕捉到纵向的、动态的症状体验。癌症行为与环境感知与干预(BESI-C)是一种远程健康监测系统,它利用智能手表和生态瞬间评估(ema),使患者和护理人员能够在家中监测和管理癌症疼痛。BESI-C在自然环境中收集实时症状数据,可以随时间纵向跟踪和分析症状模式。目的:通过远程健康监测收集的参与者发起的症状报告来定义和检查二元一致性。方法:招募两组晚期癌症患者及其家庭护理人员使用BESI-C治疗2周,通过EMAs实时报告疼痛。我们使用Bangdiwala's B统计来确定在不同情境标准(例如,参与者的共同位置;用户与BESI-C的参与)下患者报告的疼痛和护理人员报告的患者感知疼痛的一致性,我们假设这些标准会影响一致性。我们还探讨了一个假设,即在第1周与第2周的研究中,一致性会得到改善。结果:21对患者-护理者的数据被用于分析。患者和护理人员对疼痛事件的报告差异很大。当患者和护理人员共处一室并都佩戴BESI-C智能手表时,疼痛报告的一致性得到改善。在第1周和第2周之间,我们没有观察到患者-护理者一致性的一致模式。结论:我们提出了一种分析方法来定义和评估患者和护理人员实时症状报告之间的一致性,该方法可以应用于远程健康监测收集的双元纵向症状数据。未来的工作应该研究患者-护理者症状一致性与关键生活质量指标之间的关系,以及影响参与者参与远程健康监测技术的社会人口因素。
{"title":"Defining dyadic cancer pain concordance using participant-initiated interactions with a remote health monitoring system.","authors":"Mina Ostovari, Natalie Crimp, Sarah J Ratcliffe, Virginia LeBaron","doi":"10.1093/jamiaopen/ooaf088","DOIUrl":"10.1093/jamiaopen/ooaf088","url":null,"abstract":"<p><strong>Background: </strong>Studies on symptom concordance between patients and their caregivers often use cross-sectional designs, which may fail to capture the longitudinal, dynamic symptom experience. The Behavioral and Environmental Sensing and Intervention for Cancer (BESI-C) is a remote health monitoring system that utilizes smartwatches and ecological momentary assessments (EMAs) to empower patients and caregivers to monitor and manage cancer pain at home. BESI-C collects real-time symptom data in naturalistic settings, enabling longitudinal tracking and analysis of symptom patterns over time.</p><p><strong>Objective: </strong>To define and examine dyadic concordance using participant-initiated symptom reports collected via remote health monitoring.</p><p><strong>Methods: </strong>Dyads of patients with advanced cancer and their family caregivers were recruited to use BESI-C for 2 weeks, reporting pain in real time through EMAs. We used Bangdiwala's B statistic to determine the concordance of patient-reported pain and caregiver-reported perceived patient pain under different contextual criteria (eg, co-location of participants; user engagement with BESI-C) that we hypothesized would impact concordance. We also explored a hypothesis that concordance would improve between study week 1 versus week 2.</p><p><strong>Results: </strong>Data from 21 patient-caregiver dyads were used for analysis. The reporting of pain events was highly variable between patients and their caregivers. Concordance of pain reporting improved when patients and caregivers were co-located and both wearing their BESI-C smartwatches. We did not observe consistent patterns in patient-caregiver concordance between week 1 and week 2.</p><p><strong>Conclusion: </strong>We propose an analytical approach to define and evaluate concordance between patients' and caregivers' real-time symptom reports that can be applied to dyadic, longitudinal symptom data collected using remote health monitoring. Future work should examine the relationship between patient-caregiver symptom concordance with key quality-of-life metrics and sociodemographic factors that impact participant engagement with remote health monitoring technologies.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf088"},"PeriodicalIF":3.4,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12373113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the advantages and disadvantages of dimensionality reduction methods in summarizing housing determinants of health in the United States. 在总结美国住房健康决定因素时评估降维方法的优缺点。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-18 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf093
Xingyu Chen, Christopher Kitchen, Hadi Kharrazi

Objectives: To evaluate and compare different dimensionality reduction techniques for quantifying housing conditions as a social determinant of health (SDOH) across various geographic levels in the United States.

Materials and methods: A total of 15 housing characteristics from the American Community Survey data were analyzed at county, ZIP code, and Census tract levels. The robustness of 3 dimensionality reduction techniques was assessed in reducing the 15 housing characteristics into 1 housing score. These summarization methods included principal component analysis (PCA), t-distributed stochastic neighbor embedding (tSNE), and uniform manifold approximation and projection (UMAP). We visualized geographic distributions of the housing scores, assessed methodological discrepancies between the techniques, and analyzed agreement between housing characteristic variability and housing score variability.

Results: The selected dimensionality reduction methods generated housing scores that demonstrated acceptable face validity when visualized through choropleth maps. The PCA method provided the most stable and consistent results across geographic levels. The PCA method also resulted in the highest correlation between the variability of the underlying housing characteristics and the summarized housing score.

Discussion: Data-driven summarization techniques provide an alternative approach to traditional expert-based indices in capturing housing conditions as a single SDOH factor. In this study, among the different summarized housing scores, the PCA-generated score offered superior robustness, persistent data structure, and higher stability across years.

Conclusion: Principal component analysis was identified as the most reliable and interpretable approach for summarizing housing conditions across geographic levels. These findings contribute to the methodological foundation required to develop robust SDOH measures that can inform public health policies and address health disparities.

目的:评估和比较不同的降维技术,以量化住房条件作为健康的社会决定因素(SDOH)在美国不同的地理水平。材料和方法:从美国社区调查数据中,对县、邮政编码和人口普查区的15个住房特征进行了分析。在将15个住房特征简化为1个住房评分时,评估了3维降维技术的稳健性。这些总结方法包括主成分分析(PCA)、t分布随机邻居嵌入(tSNE)和均匀流形逼近与投影(UMAP)。我们可视化了住房得分的地理分布,评估了技术之间的方法差异,并分析了住房特征变异性和住房得分变异性之间的一致性。结果:所选择的降维方法生成的住房分数在通过地形图可视化时显示出可接受的面部效度。主成分分析法在不同地理水平上提供了最稳定和一致的结果。主成分分析方法还发现,潜在住房特征的变异性与汇总住房得分之间的相关性最高。讨论:数据驱动的总结技术为传统的基于专家的指数提供了一种替代方法,将住房条件作为单一的SDOH因素。在本研究中,在不同的汇总住房得分中,pca生成的得分具有更好的稳健性、持久的数据结构和更高的跨年稳定性。结论:主成分分析被认为是总结不同地理水平住房状况的最可靠和可解释的方法。这些发现为制定强有力的SDOH措施提供了方法学基础,这些措施可以为公共卫生政策提供信息,并解决健康差距问题。
{"title":"Assessing the advantages and disadvantages of dimensionality reduction methods in summarizing housing determinants of health in the United States.","authors":"Xingyu Chen, Christopher Kitchen, Hadi Kharrazi","doi":"10.1093/jamiaopen/ooaf093","DOIUrl":"10.1093/jamiaopen/ooaf093","url":null,"abstract":"<p><strong>Objectives: </strong>To evaluate and compare different dimensionality reduction techniques for quantifying housing conditions as a social determinant of health (SDOH) across various geographic levels in the United States.</p><p><strong>Materials and methods: </strong>A total of 15 housing characteristics from the American Community Survey data were analyzed at county, ZIP code, and Census tract levels. The robustness of 3 dimensionality reduction techniques was assessed in reducing the 15 housing characteristics into 1 housing score. These summarization methods included principal component analysis (PCA), t-distributed stochastic neighbor embedding (tSNE), and uniform manifold approximation and projection (UMAP). We visualized geographic distributions of the housing scores, assessed methodological discrepancies between the techniques, and analyzed agreement between housing characteristic variability and housing score variability.</p><p><strong>Results: </strong>The selected dimensionality reduction methods generated housing scores that demonstrated acceptable face validity when visualized through choropleth maps. The PCA method provided the most stable and consistent results across geographic levels. The PCA method also resulted in the highest correlation between the variability of the underlying housing characteristics and the summarized housing score.</p><p><strong>Discussion: </strong>Data-driven summarization techniques provide an alternative approach to traditional expert-based indices in capturing housing conditions as a single SDOH factor. In this study, among the different summarized housing scores, the PCA-generated score offered superior robustness, persistent data structure, and higher stability across years.</p><p><strong>Conclusion: </strong>Principal component analysis was identified as the most reliable and interpretable approach for summarizing housing conditions across geographic levels. These findings contribute to the methodological foundation required to develop robust SDOH measures that can inform public health policies and address health disparities.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf093"},"PeriodicalIF":3.4,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12360777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144883974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking of pre-training strategies for electronic health record foundation models. 电子健康记录基础模型预训练策略的基准测试。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-13 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf090
Samson Mataraso, Shreya D'Souza, David Seong, Eloïse Berson, Camilo Espinosa, Nima Aghaeepour

Objective: Our objective is to compare different pre-training strategies for electronic health record (EHR) foundation models.

Materials and methods: We evaluated three approaches using a transformer-based architecture: baseline (no pre-training), self-supervised pre-training with masked language modeling, and supervised pre-training. The models were assessed on their ability to predict both major adverse cardiac events and mortality occurring within 12 months. The pre-training cohort was 405 679 patients prescribed antihypertensives and the fine tuning cohort was 5525 patients who received doxorubicin.

Results: Task-specific supervised pre-training achieved superior performance (AUROC 0.70, AUPRC 0.23), outperforming both self-supervised pre-training and the baseline. However, when the model was evaluated on the task of 12-month mortality prediction, the self-supervised model performed best.

Discussion: While supervised pre-training excels when aligned with downstream tasks, self-supervised approaches offer more generalized utility.

Conclusion: Pre-training strategy selection should consider intended applications, data availability, and transferability requirements.

目的:我们的目的是比较不同的预训练策略的电子健康记录(EHR)基础模型。材料和方法:我们使用基于变压器的体系结构评估了三种方法:基线(无预训练)、使用屏蔽语言建模的自监督预训练和监督预训练。评估模型预测12个月内发生的主要不良心脏事件和死亡率的能力。训练前队列为405679例服用降压药的患者,微调队列为5525例服用阿霉素的患者。结果:特定任务监督预训练取得了优异的成绩(AUROC为0.70,AUPRC为0.23),优于自我监督预训练和基线。然而,当模型在12个月死亡率预测任务上进行评估时,自监督模型表现最好。讨论:虽然监督预训练在与下游任务相结合时表现出色,但自监督方法提供了更广泛的效用。结论:训练前策略选择应考虑预期应用、数据可用性和可移植性要求。
{"title":"Benchmarking of pre-training strategies for electronic health record foundation models.","authors":"Samson Mataraso, Shreya D'Souza, David Seong, Eloïse Berson, Camilo Espinosa, Nima Aghaeepour","doi":"10.1093/jamiaopen/ooaf090","DOIUrl":"10.1093/jamiaopen/ooaf090","url":null,"abstract":"<p><strong>Objective: </strong>Our objective is to compare different pre-training strategies for electronic health record (EHR) foundation models.</p><p><strong>Materials and methods: </strong>We evaluated three approaches using a transformer-based architecture: baseline (no pre-training), self-supervised pre-training with masked language modeling, and supervised pre-training. The models were assessed on their ability to predict both major adverse cardiac events and mortality occurring within 12 months. The pre-training cohort was 405 679 patients prescribed antihypertensives and the fine tuning cohort was 5525 patients who received doxorubicin.</p><p><strong>Results: </strong>Task-specific supervised pre-training achieved superior performance (AUROC 0.70, AUPRC 0.23), outperforming both self-supervised pre-training and the baseline. However, when the model was evaluated on the task of 12-month mortality prediction, the self-supervised model performed best.</p><p><strong>Discussion: </strong>While supervised pre-training excels when aligned with downstream tasks, self-supervised approaches offer more generalized utility.</p><p><strong>Conclusion: </strong>Pre-training strategy selection should consider intended applications, data availability, and transferability requirements.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf090"},"PeriodicalIF":3.4,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12349770/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144849275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EchoLLM: extracting echocardiogram entities with light-weight, open-source large language models. EchoLLM:用轻量级、开源的大型语言模型提取超声心动图实体。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-13 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf092
Jonathan Chi, Yazan Rouphail, Ethan Hillis, Ningning Ma, An Nguyen, Jane Wang, Mackenzie Hofford, Aditi Gupta, Patrick G Lyons, Adam Wilcox, Albert M Lai, Philip R O Payne, Marin H Kollef, Caitlin Dreisbach, Andrew P Michelson

Objectives: Large language models (LLMs) have demonstrated high levels of performance in clinical information extraction compared to rule-based systems and traditional machine-learning approaches, offering scalability, contextualization, and easier deployment. However, most studies rely on proprietary models with privacy concerns and high costs, limiting accessibility. We aim to evaluate 14 publicly available open-source LLMs for extracting clinically relevant findings from free-text echocardiogram reports and examine the feasibility of their implementation in information extraction workflows.

Materials and methods: We used 14 open-source LLM models to extract clinically relevant entities from echocardiogram reports (n = 507). Each report was manually annotated by 2 independent health-care professionals and adjudicated by a third. Lexical variance and length of each echocardiogram report were collected. Precision, recall, and F1 scores were calculated for the 9 extracted entities via multiclass classification.

Results: In aggregate, Gemma2:9b-instruct had the highest precision, recall, and F1 scores at 0.973 (0.962-0.983), 0.959 (0.947-0.973), and 0.965 (0.951-0.975), respectively. In comparison, Phi3:3.8b-mini-instruct had the lowest precision score at 0.831 (0.804-0.856), while Gemma:7b-instruct had the lowest recall and F1 scores at 0.382 (0.356-0.408) and 0.392 (0.356-0.428), respectively.

Discussion and conclusion: Using LLMs for entity extraction for echocardiogram reports has the potential to support both clinical research and health-care delivery. Our work demonstrates the feasibility of using open-source models for more efficient computation and extraction.

与基于规则的系统和传统的机器学习方法相比,大型语言模型(llm)在临床信息提取方面表现出了高水平的性能,提供了可扩展性、上下文化和更容易部署。然而,大多数研究依赖于具有隐私问题和高成本的专有模型,限制了可访问性。我们的目标是评估14个公开可用的开源法学硕士,用于从自由文本超声心动图报告中提取临床相关发现,并检查它们在信息提取工作流程中实施的可行性。材料和方法:我们使用14个开源LLM模型从超声心动图报告中提取临床相关实体(n = 507)。每份报告由2名独立的保健专业人员手工注释,并由第三名专业人员裁定。收集每个超声心动图报告的词法差异和长度。通过多类分类计算9个提取实体的精度、召回率和F1分数。结果:总体而言,gemma2:9b- instruction的准确率、召回率和F1得分最高,分别为0.973(0.962-0.983)、0.959(0.947-0.973)和0.965(0.951-0.975)。phi3:3.8b-mini- directive的准确率最低,为0.831 (0.804-0.856);Gemma:7b- directive的召回率和F1得分最低,分别为0.382(0.356-0.408)和0.392(0.356-0.428)。讨论与结论:在超声心动图报告中使用llm进行实体提取有可能支持临床研究和医疗保健服务。我们的工作证明了使用开源模型进行更有效的计算和提取的可行性。
{"title":"EchoLLM: extracting echocardiogram entities with light-weight, open-source large language models.","authors":"Jonathan Chi, Yazan Rouphail, Ethan Hillis, Ningning Ma, An Nguyen, Jane Wang, Mackenzie Hofford, Aditi Gupta, Patrick G Lyons, Adam Wilcox, Albert M Lai, Philip R O Payne, Marin H Kollef, Caitlin Dreisbach, Andrew P Michelson","doi":"10.1093/jamiaopen/ooaf092","DOIUrl":"10.1093/jamiaopen/ooaf092","url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) have demonstrated high levels of performance in clinical information extraction compared to rule-based systems and traditional machine-learning approaches, offering scalability, contextualization, and easier deployment. However, most studies rely on proprietary models with privacy concerns and high costs, limiting accessibility. We aim to evaluate 14 publicly available open-source LLMs for extracting clinically relevant findings from free-text echocardiogram reports and examine the feasibility of their implementation in information extraction workflows.</p><p><strong>Materials and methods: </strong>We used 14 open-source LLM models to extract clinically relevant entities from echocardiogram reports (<i>n</i> = 507). Each report was manually annotated by 2 independent health-care professionals and adjudicated by a third. Lexical variance and length of each echocardiogram report were collected. Precision, recall, and F1 scores were calculated for the 9 extracted entities via multiclass classification.</p><p><strong>Results: </strong>In aggregate, Gemma2:9b-instruct had the highest precision, recall, and F1 scores at 0.973 (0.962-0.983), 0.959 (0.947-0.973), and 0.965 (0.951-0.975), respectively. In comparison, Phi3:3.8b-mini-instruct had the lowest precision score at 0.831 (0.804-0.856), while Gemma:7b-instruct had the lowest recall and F1 scores at 0.382 (0.356-0.408) and 0.392 (0.356-0.428), respectively.</p><p><strong>Discussion and conclusion: </strong>Using LLMs for entity extraction for echocardiogram reports has the potential to support both clinical research and health-care delivery. Our work demonstrates the feasibility of using open-source models for more efficient computation and extraction.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf092"},"PeriodicalIF":3.4,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12349756/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144849276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating prompt and data perturbation sensitivity in large language models for radiology reports classification. 评估提示和数据扰动敏感性的大语言模型放射学报告分类。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-12 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf073
Vera Sorin, Jeremy D Collins, Alex K Bratt, Joanna E Kusmirek, Vamshi K Mugu, Timothy L Kline, Crystal L Butler, Nadia G Wood, Cole J Cook, Panagiotis Korfiatis

Objectives: Large language models (LLMs) offer potential in natural language processing tasks in healthcare. Due to the need for high accuracy, understanding their limitations is essential. The purpose of this study was to evaluate the performance of LLMs in classifying radiology reports for the presence of pulmonary embolism (PE) under various conditions, including different prompt designs and data perturbations.

Materials and methods: In this retrospective, institutional review board approved study, we evaluated 3 Google's LLMs including Gemini-1.5-Pro, Gemini-1.5-Flash-001, and Gemini-1.5-Flash-002, in classifying 11 999 pulmonary CT angiography radiology reports for PE. Ground truth labels were determined by concordance between a computer vision-based PE detection (CVPED) algorithm and multiple LLM runs under various configurations. Discrepancies between algorithms' classifications were aggregated and manually reviewed. We evaluated the effects of prompt design, data perturbations, and repeated analyses across geographic cloud regions. Performance metrics were calculated.

Results: Of 11 999 reports, 1296 (10.8%) were PE-positive. Accuracy across LLMs ranged between 0.953 and 0.996. The highest recall rate for a prompt modified after a review of the misclassified cases (up to 0.997). Few-shot prompting improved recall (up to 0.99), while chain-of-thought generally degraded performance. Gemini-1.5-Flash-002 demonstrated the highest robustness against data perturbations. Geographic cloud region variability was minimal for Gemini-1.5+-Pro, while the Flash models showed stable performance.

Discussion and conclusion: LLMs demonstrated high performance in classifying radiology reports, though results varied with prompt design and data quality. These findings underscore the need for systematic evaluation and validation of LLMs for clinical applications, particularly in high-stakes scenarios.

目的:大型语言模型(llm)为医疗保健领域的自然语言处理任务提供了潜力。由于需要高精度,了解它们的局限性是必不可少的。本研究的目的是评估llm在不同条件下对肺栓塞(PE)的放射学报告进行分类的性能,包括不同的提示设计和数据扰动。材料和方法:在这项经机构审查委员会批准的回顾性研究中,我们评估了3 b谷歌的llm,包括Gemini-1.5-Pro, Gemini-1.5-Flash-001和Gemini-1.5-Flash-002,对11 999例肺CT血管造影放射学报告进行PE分类。基于计算机视觉的PE检测(CVPED)算法与不同配置下的多个LLM运行之间的一致性确定了地面真值标签。算法分类之间的差异被汇总并人工审查。我们评估了即时设计、数据扰动和跨地理云区重复分析的影响。计算了性能指标。结果:11 999例报告中,pe阳性1296例(10.8%)。llm的准确度在0.953 ~ 0.996之间。对错误分类的案例进行审查后修改提示的最高召回率(高达0.997)。少量的提示提高了召回率(高达0.99),而思维链通常会降低性能。Gemini-1.5-Flash-002对数据扰动表现出最高的鲁棒性。Gemini-1.5+-Pro的地理云区域变化最小,而Flash模型表现出稳定的性能。讨论和结论:llm在分类放射学报告方面表现出很高的性能,尽管结果因提示设计和数据质量而异。这些发现强调了对法学硕士临床应用进行系统评估和验证的必要性,特别是在高风险情况下。
{"title":"Evaluating prompt and data perturbation sensitivity in large language models for radiology reports classification.","authors":"Vera Sorin, Jeremy D Collins, Alex K Bratt, Joanna E Kusmirek, Vamshi K Mugu, Timothy L Kline, Crystal L Butler, Nadia G Wood, Cole J Cook, Panagiotis Korfiatis","doi":"10.1093/jamiaopen/ooaf073","DOIUrl":"10.1093/jamiaopen/ooaf073","url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) offer potential in natural language processing tasks in healthcare. Due to the need for high accuracy, understanding their limitations is essential. The purpose of this study was to evaluate the performance of LLMs in classifying radiology reports for the presence of pulmonary embolism (PE) under various conditions, including different prompt designs and data perturbations.</p><p><strong>Materials and methods: </strong>In this retrospective, institutional review board approved study, we evaluated 3 Google's LLMs including Gemini-1.5-Pro, Gemini-1.5-Flash-001, and Gemini-1.5-Flash-002, in classifying 11 999 pulmonary CT angiography radiology reports for PE. Ground truth labels were determined by concordance between a computer vision-based PE detection (CVPED) algorithm and multiple LLM runs under various configurations. Discrepancies between algorithms' classifications were aggregated and manually reviewed. We evaluated the effects of prompt design, data perturbations, and repeated analyses across geographic cloud regions. Performance metrics were calculated.</p><p><strong>Results: </strong>Of 11 999 reports, 1296 (10.8%) were PE-positive. Accuracy across LLMs ranged between 0.953 and 0.996. The highest recall rate for a prompt modified after a review of the misclassified cases (up to 0.997). Few-shot prompting improved recall (up to 0.99), while chain-of-thought generally degraded performance. Gemini-1.5-Flash-002 demonstrated the highest robustness against data perturbations. Geographic cloud region variability was minimal for Gemini-1.5+-Pro, while the Flash models showed stable performance.</p><p><strong>Discussion and conclusion: </strong>LLMs demonstrated high performance in classifying radiology reports, though results varied with prompt design and data quality. These findings underscore the need for systematic evaluation and validation of LLMs for clinical applications, particularly in high-stakes scenarios.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf073"},"PeriodicalIF":3.4,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343119/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble learning to enhance accurate identification of patients with glaucoma using electronic health records. 集成学习提高青光眼患者使用电子健康记录的准确识别。
IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-10 eCollection Date: 2025-08-01 DOI: 10.1093/jamiaopen/ooaf080
Tushar Mungle, Behzad Naderalvojoud, Chris A Andrews, Hong Su An, Amanda Bicket, Amy Zhang, Julie Rosenthal, Wen-Shin Lee, Chase A Ludwig, Bethlehem Mekonnen, Suzann Pershing, Joshua D Stein, Tina Hernandez-Boussard

Objectives: Existing ophthalmology studies for clinical phenotypes identification in real-world datasets (RWD) rely exclusively on structured data elements (SDE). We evaluated the performance, generalizability, and fairness of multimodal ensemble models that integrate real-world SDE and free-text data compared to SDE-only models to identify patients with glaucoma.

Materials and methods: This is a retrospective cross-sectional study involving 2 health systems- University of Michigan (UoM) and Stanford University (SU). It involves 1728 patients visiting eye clinics during 2012-2021. Free-text embeddings extracted using BioClinicalBERT were combined with SDE. EditedNearestNeighbor (ENN) undersampling and Borderline-Synthetic Minority Over-sampling Technique (bSMOTE) addressed class imbalance. Lasso Regression (LR), Random Forest (RF), Support Vector Classifier (SVC) models were trained on UoM imbalanced (imb) and resampled data along with bagging ensemble method. Models were externally validated with SU data. Fairness was assessed using equalized odds difference (EOD) and Target Probability Difference (TPD).

Results: Among 900 and 828 patients from UoM and SU, 10% and 23% respectively had glaucoma as confirmed by ophthalmologists. At UoM, multimodal LRimb (F1 = 76.60 [61.90-88.89]; AUROC = 95.41 [87.01-99.63]) outperformed unimodal RFimb (F1 = 69.77 [52.94-83.64]; AUROC = 97.72 [95.95-99.18]) and ICD-coding method (F1 = 53.01 [39.51-65.43]; AUROC = 90.10 [84.59-93.93]). Bagging (BM = LRENN + LRbSMOTE) improved performance achieving an F1 of 83.02 [70.59-92.86] and AUROC of 97.59 [92.98-99.88]. During external validation BM achieved the highest F1 (68.47 [62.61-73.75]), outperforming unimodal (F1 = 51.26 [43.80-58.13]) and multimodal LRimb (F1 = 62.46 [55.95-68.24]). BM EOD revealed lower disparities for sex (<0.1), race (<0.5) and ethnicity (<0.5), and had least uncertainty using TDP analysis as compared to traditional models.

Discussion: Multimodal ensemble models integrating structured and unstructured EHR data outperformed traditional SDE models achieving fair predictions across demographic sub-groups. Among ensemble methods, bagging demonstrated better generalizability than stacking, particularly when training data is limited.

Conclusion: This approach can enhance phenotype discovery to enable future research studies using RWD, leading to better patient management and clinical outcomes.

目的:在现实世界数据集(RWD)中进行临床表型鉴定的现有眼科研究完全依赖于结构化数据元素(SDE)。我们评估了整合真实SDE和自由文本数据的多模态集成模型与仅SDE模型的性能、通用性和公平性,以识别青光眼患者。材料和方法:这是一项回顾性横断面研究,涉及两个卫生系统-密歇根大学(UoM)和斯坦福大学(SU)。它涉及2012-2021年期间访问眼科诊所的1728名患者。使用BioClinicalBERT提取的自由文本嵌入与SDE相结合。EditedNearestNeighbor (ENN)欠采样和Borderline-Synthetic Minority oversampling technology (bSMOTE)解决了类不平衡问题。套索回归(LR)、随机森林(RF)和支持向量分类器(SVC)模型在UoM不平衡(imb)和重采样数据上进行了训练,并采用套袋集成方法。使用SU数据对模型进行外部验证。公平性评估采用均等几率差(EOD)和目标概率差(TPD)。结果:在900例UoM和828例SU患者中,经眼科医生确诊为青光眼的分别占10%和23%。在UoM,多模态LRimb (F1 = 76.60 [61.90-88.89];AUROC = 95.41[87.01-99.63])优于单峰rfib (F1 = 69.77 [52.94-83.64];AUROC = 97.72[95.95-99.18])和icd编码方法(F1 = 53.01 [39.51-65.43];Auroc = 90.10[84.59-93.93])。套袋(BM = LRENN + LRbSMOTE)提高了性能,F1为83.02 [70.59-92.86],AUROC为97.59[92.98-99.88]。在外部验证中,BM获得了最高的F1(68.47[62.61-73.75]),优于单模态(F1 = 51.26[43.80-58.13])和多模态LRimb (F1 = 62.46[55.95-68.24])。BM EOD显示性别差异较小(讨论:集成结构化和非结构化EHR数据的多模态集成模型优于传统的SDE模型,在人口统计子组中实现了公平的预测。在集成方法中,套袋比堆叠表现出更好的泛化性,特别是当训练数据有限时。结论:该方法可以增强表型发现,使未来使用RWD的研究成为可能,从而改善患者管理和临床结果。
{"title":"Ensemble learning to enhance accurate identification of patients with glaucoma using electronic health records.","authors":"Tushar Mungle, Behzad Naderalvojoud, Chris A Andrews, Hong Su An, Amanda Bicket, Amy Zhang, Julie Rosenthal, Wen-Shin Lee, Chase A Ludwig, Bethlehem Mekonnen, Suzann Pershing, Joshua D Stein, Tina Hernandez-Boussard","doi":"10.1093/jamiaopen/ooaf080","DOIUrl":"10.1093/jamiaopen/ooaf080","url":null,"abstract":"<p><strong>Objectives: </strong>Existing ophthalmology studies for clinical phenotypes identification in real-world datasets (RWD) rely exclusively on structured data elements (SDE). We evaluated the performance, generalizability, and fairness of multimodal ensemble models that integrate real-world SDE and free-text data compared to SDE-only models to identify patients with glaucoma.</p><p><strong>Materials and methods: </strong>This is a retrospective cross-sectional study involving 2 health systems- University of Michigan (UoM) and Stanford University (SU). It involves 1728 patients visiting eye clinics during 2012-2021. Free-text embeddings extracted using BioClinicalBERT were combined with SDE. EditedNearestNeighbor (ENN) undersampling and Borderline-Synthetic Minority Over-sampling Technique (bSMOTE) addressed class imbalance. Lasso Regression (LR), Random Forest (RF), Support Vector Classifier (SVC) models were trained on UoM imbalanced (imb) and resampled data along with bagging ensemble method. Models were externally validated with SU data. Fairness was assessed using equalized odds difference (EOD) and Target Probability Difference (TPD).</p><p><strong>Results: </strong>Among 900 and 828 patients from UoM and SU, 10% and 23% respectively had glaucoma as confirmed by ophthalmologists. At UoM, multimodal LR<sub>imb</sub> (F1 = 76.60 [61.90-88.89]; AUROC = 95.41 [87.01-99.63]) outperformed unimodal RF<sub>imb</sub> (F1 = 69.77 [52.94-83.64]; AUROC = 97.72 [95.95-99.18]) and ICD-coding method (F1 = 53.01 [39.51-65.43]; AUROC = 90.10 [84.59-93.93]). Bagging (BM = LR<sub>ENN</sub> + LR<sub>bSMOTE</sub>) improved performance achieving an F1 of 83.02 [70.59-92.86] and AUROC of 97.59 [92.98-99.88]. During external validation BM achieved the highest F1 (68.47 [62.61-73.75]), outperforming unimodal (F1 = 51.26 [43.80-58.13]) and multimodal LR<sub>imb</sub> (F1 = 62.46 [55.95-68.24]). BM EOD revealed lower disparities for sex (<0.1), race (<0.5) and ethnicity (<0.5), and had least uncertainty using TDP analysis as compared to traditional models.</p><p><strong>Discussion: </strong>Multimodal ensemble models integrating structured and unstructured EHR data outperformed traditional SDE models achieving fair predictions across demographic sub-groups. Among ensemble methods, bagging demonstrated better generalizability than stacking, particularly when training data is limited.</p><p><strong>Conclusion: </strong>This approach can enhance phenotype discovery to enable future research studies using RWD, leading to better patient management and clinical outcomes.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 4","pages":"ooaf080"},"PeriodicalIF":3.4,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JAMIA Open
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1