Journal of Biomedical Informatics最新文献_第8页

Interpretable statistical modeling of patient flow in emergency departments 急诊科病人流动的可解释统计模型。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-14 DOI: 10.1016/j.jbi.2025.104937

Hugo Álvarez-Chaves, María D. R-Moreno

Objective:

This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.

Methods:

We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.

Results:

Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.

Conclusion:

The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.

目的：本文旨在开发一个数据驱动的仿真框架，在缺乏系统资源数据的情况下，使用可解释的方法对医院急诊科的整个流程进行建模。目标是提高对系统动力学的理解，并通过透明的模拟支持决策过程，即使在资源数据不可用的情况下也是如此。方法：我们利用西班牙一家医院急诊科的匿名医疗记录开发了一个模拟框架。该模型通过识别路线和测量每个阶段之间的过渡时间来捕获考虑分诊级别的患者流量。我们使用参数（理论）分布和非参数核密度估计（KDE）来估计这些过渡。病人入院时间用概率分布建模。我们通过由容忍阈值和定量度量指导的迭代改进过程增强了现实性。该过程将合成数据细化到与原始分布相匹配。结果：我们的方法在迭代方法中产生了具有低容差值的高度逼真的患者流模拟。这个过程逐渐向原始数据收敛。距离和散度指标以及统计测试结果表明，模拟与实际数据高度相似，当容差阈值较低时，100%的生成样本同时通过了Mann-Whitney U和Kolmogorov-Smirnov测试。结论：实验结果表明，即使在缺乏与服务资源相关的信息的情况下，我们的模拟方法也能有效地再现患者流动动力学，具有很高的真实感和灵活性。其可解释的设计和可调整的参数使安全的数据分析和探索替代管理策略（例如，修改潜在的病人路线或限制一些过渡）。这些特点使该方法成为支持知情决策的宝贵工具，并表明其在其他具有适当数据的医院中使用的潜力，有待外部数据集的验证。

{"title":"Interpretable statistical modeling of patient flow in emergency departments","authors":"Hugo Álvarez-Chaves, María D. R-Moreno","doi":"10.1016/j.jbi.2025.104937","DOIUrl":"10.1016/j.jbi.2025.104937","url":null,"abstract":"<div><h3>Objective:</h3><div>This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.</div></div><div><h3>Methods:</h3><div>We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.</div></div><div><h3>Results:</h3><div>Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.</div></div><div><h3>Conclusion:</h3><div>The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104937"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records 从纵向患者记录中发现胰腺癌和软组织肉瘤的标志性疾病轨迹。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-19 DOI: 10.1016/j.jbi.2025.104935

Liwei Wang , Rui Li , Andrew Wen , Qiuhao Lu , Jinlian Wang , Xiaoyang Ruan , Adriana Gamboa , Neha Malik , Christina L. Roland , Matthew H.G. Katz , Heather Lyu , Hongfang Liu

Background

Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.

Objectives

We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).

Materials and Methods

Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).

Results

Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.

Conclusion

We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.

背景：大多数临床医生对罕见病的经验有限，使得诊断和治疗具有挑战性。电子健康记录（EHRs）等大型真实世界数据源提供了大量信息，可用于确定罕见肿瘤的诊断和治疗模式，从而作为临床决策辅助工具。目的：研究胰腺癌、躯干及四肢STS （STS- te）和腹部及腹膜后STS (STS- ar) 3种罕见肿瘤的特征发病轨迹。材料和方法：利用IQVIA肿瘤电子病历，我们通过匹配队列抽样、统计计算、右尾二项假设检验，确定了这些癌症患者在3 年内的显著诊断对，然后可视化了3个进展的轨迹。我们进一步用UTHealth电子健康记录（EHR）对发现的轨迹进行了系统验证。结果：结果包括266对胰腺癌，130对STS-TE， 118对STS-AR的显著诊断。我们进一步发现胰腺癌前44个2-跳（即2-进展）和136个3-跳轨迹，STS-TE前36个2-跳和37个3-跳轨迹，STS-AR前17个2-跳和5个3-跳轨迹。同时，我们发现胰腺癌后有54个2-跳和129个3-跳轨迹，STS-TE后有11个2-跳和17个3-跳轨迹，STS-AR后有5个2-跳和0个3-跳轨迹。例如，64例（0.5%）患者在胰腺癌前出现关节痛和胃食管反流病，40例（0.9%）患者在STS-TE前出现关节痛和“四肢、手、脚、手指和脚趾痛”，256例（1.9%）患者在胰腺癌后出现癌症化疗后继发粒细胞缺乏症和肿瘤相关疼痛。使用UTHealth电子病历系统验证了所发现轨迹的有效性。结论：通过利用大规模电子病历数据和轨迹挖掘方法，我们确定了所研究的罕见癌症的标志性疾病轨迹。这些疾病轨迹可以作为临床医生的潜在资源，加深他们对这些罕见癌症之前和之后病情的时间进展的理解，进一步为患者护理决策提供信息。

{"title":"Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records","authors":"Liwei Wang , Rui Li , Andrew Wen , Qiuhao Lu , Jinlian Wang , Xiaoyang Ruan , Adriana Gamboa , Neha Malik , Christina L. Roland , Matthew H.G. Katz , Heather Lyu , Hongfang Liu","doi":"10.1016/j.jbi.2025.104935","DOIUrl":"10.1016/j.jbi.2025.104935","url":null,"abstract":"<div><h3>Background</h3><div>Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.</div></div><div><h3>Objectives</h3><div>We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).</div></div><div><h3>Materials and Methods</h3><div>Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).</div></div><div><h3>Results</h3><div>Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.</div></div><div><h3>Conclusion</h3><div>We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104935"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications 推进医疗保健分析：机器学习，健康信息学和现实世界数据应用的专题审查。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-16 DOI: 10.1016/j.jbi.2025.104934

Maria I. Arias, Lorena Cadavid, Juan D. Velásquez

Objective

To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.

Methods

A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.

Results

Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.

Conclusions

The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.

目的：通过确定主要的专题集群，综合关键趋势，概述该领域的转化挑战和研究机会，绘制医疗保健分析的概念和方法景观。方法：采用无监督文本挖掘和聚类技术对共2,281篇scopus索引的出版物进行分析。分析的重点是确定临床、行政和公共卫生背景下医疗分析文献中反复出现的主题、方法创新和差距。结果：确定了八个主要主题：预测性医疗保健的智能系统、以患者为中心的健康分析、用于临床见解的自适应人工智能、人口健康分析、数字精神健康监测、用于健康监测的伦理分析、通过数据分析进行个性化护理，以及用于疫情应对的人工智能驱动的见解。这些反映了向实时、多模式和基于道德的分析生态系统的转变。持续存在的挑战包括数据互操作性、算法不透明、评估标准化和人口统计偏差。结论：该综述强调了新兴的优先事项，包括可解释的人工智能、联邦学习和情境感知建模，以及与数据隐私和数字公平相关的道德考虑。实用建议包括与医疗保健专业人员共同设计、投资基础设施以及部署实时临床决策支持。医疗保健分析被定位为学习卫生系统的基础支柱，对转化研究和精确健康具有广泛的影响。

{"title":"Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications","authors":"Maria I. Arias, Lorena Cadavid, Juan D. Velásquez","doi":"10.1016/j.jbi.2025.104934","DOIUrl":"10.1016/j.jbi.2025.104934","url":null,"abstract":"<div><h3>Objective</h3><div>To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.</div></div><div><h3>Methods</h3><div>A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.</div></div><div><h3>Results</h3><div>Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.</div></div><div><h3>Conclusions</h3><div>The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104934"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphFusion: Integrative prediction of drug synergy using multi-scale graph representations and cell line contexts GraphFusion：使用多尺度图形表示和细胞系上下文对药物协同作用进行综合预测。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-09-30 DOI: 10.1016/j.jbi.2025.104921

Biyang Zeng, Shikui Tu, Lei Xu

Predicting the synergy of drug combinations is crucial for cancer treatment and drug development. Accurate prediction requires the integration of multiple types of data, including molecular structures of individual drugs, available synergy scores between drugs, and gene expression information from different cancer cell lines. The first two types contain multi-scale information within or between drugs, while the cell lines serve as the contextual background for drug interactions. Existing machine learning methods fail to fully utilize and integrate these information, leading to suboptimal performance. To address this issue, we introduce GraphFusion, an innovative approach that combines molecular graphs and drug synergy graphs with cell line contextual information. By employing novel GCN and Graphormer modules capable of accepting and utilizing external information, GraphFusion integrates these two levels of graph information. Specifically, the molecular graphs pass fine-grained structural information to the synergy graphs, while the synergy graphs convey global drug interaction data to the molecular graphs. Additionally, cell line information is incorporated as contextual background. This comprehensive integration enables GraphFusion to achieve state-of-the-art results on the O’Neil and NCI-ALMANAC datasets.

预测药物组合的协同作用对癌症治疗和药物开发至关重要。准确的预测需要整合多种类型的数据，包括单个药物的分子结构，药物之间可用的协同评分，以及来自不同癌细胞系的基因表达信息。前两种类型包含药物内部或药物之间的多尺度信息，而细胞系则作为药物相互作用的背景。现有的机器学习方法不能充分利用和整合这些信息，导致性能不佳。为了解决这个问题，我们引入了GraphFusion，这是一种将分子图和药物协同作用图与细胞系上下文信息相结合的创新方法。GraphFusion采用能够接受和利用外部信息的新型GCN和graphhormer模块，将这两个层次的图形信息整合在一起。具体来说，分子图将细粒度的结构信息传递给协同图，而协同图将全局药物相互作用数据传递给分子图。此外，细胞系信息被纳入上下文背景。这种全面的集成使GraphFusion能够在O'Neil和NCI-ALMANAC数据集上实现最先进的结果。

{"title":"GraphFusion: Integrative prediction of drug synergy using multi-scale graph representations and cell line contexts","authors":"Biyang Zeng, Shikui Tu, Lei Xu","doi":"10.1016/j.jbi.2025.104921","DOIUrl":"10.1016/j.jbi.2025.104921","url":null,"abstract":"<div><div>Predicting the synergy of drug combinations is crucial for cancer treatment and drug development. Accurate prediction requires the integration of multiple types of data, including molecular structures of individual drugs, available synergy scores between drugs, and gene expression information from different cancer cell lines. The first two types contain multi-scale information within or between drugs, while the cell lines serve as the contextual background for drug interactions. Existing machine learning methods fail to fully utilize and integrate these information, leading to suboptimal performance. To address this issue, we introduce GraphFusion, an innovative approach that combines molecular graphs and drug synergy graphs with cell line contextual information. By employing novel GCN and Graphormer modules capable of accepting and utilizing external information, GraphFusion integrates these two levels of graph information. Specifically, the molecular graphs pass fine-grained structural information to the synergy graphs, while the synergy graphs convey global drug interaction data to the molecular graphs. Additionally, cell line information is incorporated as contextual background. This comprehensive integration enables GraphFusion to achieve state-of-the-art results on the O’Neil and NCI-ALMANAC datasets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104921"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-scale semantic fusion integration of dual pathway models in drug repositioning 药物重新定位中双通路模型的跨尺度语义融合整合。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-09-25 DOI: 10.1016/j.jbi.2025.104914

Mingxuan Li, Shuai Li, Zhen Li, Mandong Hu

Drug Repositioning (DR) represents an innovative drug development strategy that significantly reduces both cost and time by identifying new therapeutic indications for approved drugs. Current methods primarily focus on extracting information from drug–disease networks, but often overlook critical local structural details between nodes. This study introduces CSDPDR, a novel Dual-branch graph neural network that integrates Topology Feature Information and Salient Feature Information to enhance drug repositioning accuracy and efficiency. Through the Topology-aware branch with Adaptive Residual Graph Attention and the Saliency-aware branch with Score-Driven Top-K Convolutional Graph Pooling, the model can capture both large-scale topology patterns and fine-grained local information. Furthermore, our approach effectively alleviate graph sparsity issues through meta-path-based network enhancement and confidence-based filtering mechanisms. Comparative experiments on two benchmark datasets an additional dataset demonstrate that CSDPDR significantly outperforms several state-of-the-art baseline methods. Case studies on Alzheimer’s disease and breast neoplasms further validate the model’s practical applicability and effectiveness.

药物重新定位（DR）是一种创新的药物开发策略，通过为已批准的药物确定新的治疗适应症，显著降低成本和时间。目前的方法主要集中于从药物-疾病网络中提取信息，但往往忽略了节点之间关键的局部结构细节。本研究引入了一种新的双分支图神经网络CSDPDR，该网络将拓扑特征信息和显著特征信息相结合，以提高药物重定位的准确性和效率。通过自适应残差图注意的拓扑感知分支和分数驱动的Top-K卷积图池的显著性感知分支，该模型既能捕获大规模的拓扑模式，又能捕获细粒度的局部信息。此外，我们的方法通过基于元路径的网络增强和基于置信度的过滤机制有效地缓解了图稀疏性问题。在两个基准数据集和另一个数据集上的对比实验表明，CSDPDR显著优于几种最先进的基线方法。阿尔茨海默病和乳腺肿瘤的案例研究进一步验证了该模型的实用性和有效性。

{"title":"Cross-scale semantic fusion integration of dual pathway models in drug repositioning","authors":"Mingxuan Li, Shuai Li, Zhen Li, Mandong Hu","doi":"10.1016/j.jbi.2025.104914","DOIUrl":"10.1016/j.jbi.2025.104914","url":null,"abstract":"<div><div>Drug Repositioning (DR) represents an innovative drug development strategy that significantly reduces both cost and time by identifying new therapeutic indications for approved drugs. Current methods primarily focus on extracting information from drug–disease networks, but often overlook critical local structural details between nodes. This study introduces CSDPDR, a novel Dual-branch graph neural network that integrates Topology Feature Information and Salient Feature Information to enhance drug repositioning accuracy and efficiency. Through the Topology-aware branch with Adaptive Residual Graph Attention and the Saliency-aware branch with Score-Driven Top-K Convolutional Graph Pooling, the model can capture both large-scale topology patterns and fine-grained local information. Furthermore, our approach effectively alleviate graph sparsity issues through meta-path-based network enhancement and confidence-based filtering mechanisms. Comparative experiments on two benchmark datasets an additional dataset demonstrate that CSDPDR significantly outperforms several state-of-the-art baseline methods. Case studies on Alzheimer’s disease and breast neoplasms further validate the model’s practical applicability and effectiveness.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104914"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LFVDNet: Low-frequency variable-driven network for medical time series LFVDNet：医疗时间序列的低频变量驱动网络。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-09-23 DOI: 10.1016/j.jbi.2025.104913

Yue Zhang , Dengqun Sun , Lei Li , Jian Zhou , Xiuquan Du , Shuo Li

Objective:

Medical time series, a type of multivariate time series with missing values, is widely used to predict time series analysis, the “impute first, then predict” end-to-end architecture is used to address this issue. However, existing methods are likely to lead to the loss of uniqueness and key information of low-frequency sampled variables (LFSVs) when dealing with them. In this paper, we aim to develop a method that effectively handles LFSVs, preserving their distinctive characteristics and essential information throughout the modeling process.

Methods:

We propose a novel end-to-end method named Low-Frequency Variable-Driven network (LFVDNet) for medical time series analysis. Specifically, the Time-Aware Imputer (TA) module encodes the observed values and critical time information, and uses the attention mechanism to establish an association between the observed values and the missing values. TA adopts channel-independent strategy to prevent interference from high-frequency sampled variables (HFSVs) on LFSVs, thereby preserving the unique information contained in LFSVs. The Offset-Selection Module (OS) independently selects data points for each variable through offsets, avoiding the natural disadvantages of LFSVs in selection-based imputation, thus solving the problem of the loss of key information of LFSVs. LFVDNet is the first method for analyzing multivariate time series with missing values that emphasizes the effective utilization of LFSVs.

Results:

We carried out the experiments on four public datasets and the experimental results indicate that LFVDNet has better robustness and performance. All code is available at https://github.com/dxqllp/LFVDNet.

Conclusions:

This study proposes a novel method for medical time series analysis, namely LFVDNet, which aims to effectively utilize LFSVs. Specifically, we have designed the TA module, which performs imputation through temporal correlations. The OS module, on the other hand, performs selective imputation based on a data point selection strategy. We have verified the effectiveness of this method on four datasets constructed from PhysioNet 2012 and MIMIC-IV.

目的：医学时间序列作为一种多变量时间序列的缺失值预测被广泛应用于时间序列分析，采用“先估算后预测”的端到端架构解决这一问题。然而，现有的方法在处理低频采样变量（LFSVs）时，容易导致其唯一性和关键信息的丢失。在本文中，我们的目标是开发一种有效处理LFSVs的方法，在整个建模过程中保留其独特的特征和基本信息。方法：提出一种新的端到端医学时间序列分析方法——低频变量驱动网络（LFVDNet）。具体来说，TA （time - aware Imputer）模块对观测值和关键时间信息进行编码，并利用注意机制在观测值和缺失值之间建立关联。TA采用信道无关策略，防止高频采样变量（HFSVs）对LFSVs的干扰，从而保留了LFSVs中所包含的唯一信息。偏移选择模块（Offset-Selection Module， OS）通过偏移量独立选择每个变量的数据点，避免了LFSVs在基于选择的插值中固有的缺点，从而解决了LFSVs关键信息丢失的问题。LFVDNet是第一个强调lfsv有效利用的多变量缺失值时间序列分析方法。结果：我们在四个公共数据集上进行了实验，实验结果表明LFVDNet具有更好的鲁棒性和性能。本文提出了一种新的医学时间序列分析方法，即LFVDNet，旨在有效地利用lfsv。具体而言，我们设计了TA模块，该模块通过时间相关性进行imputation。另一方面，操作系统模块根据数据点选择策略执行选择性插补。我们在PhysioNet 2012和MIMIC-IV构建的四个数据集上验证了该方法的有效性。

{"title":"LFVDNet: Low-frequency variable-driven network for medical time series","authors":"Yue Zhang , Dengqun Sun , Lei Li , Jian Zhou , Xiuquan Du , Shuo Li","doi":"10.1016/j.jbi.2025.104913","DOIUrl":"10.1016/j.jbi.2025.104913","url":null,"abstract":"<div><h3>Objective:</h3><div>Medical time series, a type of multivariate time series with missing values, is widely used to predict time series analysis, the “impute first, then predict” end-to-end architecture is used to address this issue. However, existing methods are likely to lead to the loss of uniqueness and key information of low-frequency sampled variables (LFSVs) when dealing with them. In this paper, we aim to develop a method that effectively handles LFSVs, preserving their distinctive characteristics and essential information throughout the modeling process.</div></div><div><h3>Methods:</h3><div>We propose a novel end-to-end method named <em><strong>L</strong>ow-<strong>F</strong>requency <strong>V</strong>ariable-<strong>D</strong>riven network</em> (LFVDNet) for medical time series analysis. Specifically, the Time-Aware Imputer (TA) module encodes the observed values and critical time information, and uses the attention mechanism to establish an association between the observed values and the missing values. TA adopts channel-independent strategy to prevent interference from high-frequency sampled variables (HFSVs) on LFSVs, thereby preserving the unique information contained in LFSVs. The Offset-Selection Module (OS) independently selects data points for each variable through offsets, avoiding the natural disadvantages of LFSVs in selection-based imputation, thus solving the problem of the loss of key information of LFSVs. LFVDNet is the first method for analyzing multivariate time series with missing values that emphasizes the effective utilization of LFSVs.</div></div><div><h3>Results:</h3><div>We carried out the experiments on four public datasets and the experimental results indicate that LFVDNet has better robustness and performance. All code is available at <span><span>https://github.com/dxqllp/LFVDNet</span><svg><path></path></svg></span>.</div></div><div><h3>Conclusions:</h3><div>This study proposes a novel method for medical time series analysis, namely LFVDNet, which aims to effectively utilize LFSVs. Specifically, we have designed the TA module, which performs imputation through temporal correlations. The OS module, on the other hand, performs selective imputation based on a data point selection strategy. We have verified the effectiveness of this method on four datasets constructed from PhysioNet 2012 and MIMIC-IV.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104913"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145149181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data 构建过采样（BEFO）基因表达数据生物学评价框架。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-17 DOI: 10.1016/j.jbi.2025.104932

Kevin Fee , Suneil Jain , Ross G. Murphy , Anna Jurek-Loughrey

Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.

机器学习（ML）技术正逐渐被用于生物医学研究，以提高诊断和预后的准确性，当与临床医生一起作为决策支持系统使用时。然而，生物医学研究中使用的许多数据集往往由于人口规模小而存在严重的类不平衡，这导致机器学习模型偏向于大多数类样本。目前的过采样方法主要侧重于平衡数据集，而没有充分验证合成数据的生物学相关性，这可能会影响下游模型预测的临床适用性。为了解决这些缺点，我们提出了过采样生物评估框架（BEFO），旨在确保合成基因表达样本准确反映原始数据集中存在的生物模式。这一创新不仅减轻了偏见，而且提高了预测模型在临床场景中的可信度。我们在此基础上开发了一种合成样品的排名方法，并根据其排名评估每个样品的包含情况。该排序方法在原始数据集上计算WGCNA基因共表达簇。构建了几个随机森林来评估每个合成样本与每个簇的对齐情况。只有比真实样本更重要的合成样本才会被纳入研究。实验结果表明，与五种最先进的（SOTA）过采样方法和十种分类算法相比，我们提出的ML过采样框架可以将过采样数据集的生物学可行性平均提高11%，从而在六个真实世界的基因表达数据集上平均提高9%的分类性能，从而为生物医学ML应用中的合成数据评估建立了新的标准。

{"title":"Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data","authors":"Kevin Fee , Suneil Jain , Ross G. Murphy , Anna Jurek-Loughrey","doi":"10.1016/j.jbi.2025.104932","DOIUrl":"10.1016/j.jbi.2025.104932","url":null,"abstract":"<div><div>Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104932"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A LangChain-based pipeline for one-shot synthetic text generation using generative pre-trained transformers in palliative care research 一种基于langchain的管道，用于姑息治疗研究中使用生成式预训练转换器的一次性合成文本生成。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-13 DOI: 10.1016/j.jbi.2025.104936

Isabel Ronan , Patrice Crowley , Eva Rombouts , Nicola Cornally , Mohamad M. Saab , David Murphy , Sabin Tabirca

Objective:

As the world’s population ages, nursing homes are of increasing importance. In order to care for a growing number of older adults, intelligent technologies are needed. Artificial Intelligence can be utilised to enhance palliative care in nursing homes. However, the data needed to train artificially intelligent agents is lacking within this sensitive domain due to privacy issues. Therefore, it is difficult for researchers to develop technological solutions. With the advent of large language models, such as ChatGPT, new text generation methods are made possible using limited data. In this pilot study, we investigate the use of large language models to generate synthetic data.

Methods:

We investigate the feasibility of using GPT-3.5 and GPT-4o models along with one-shot prompting to produce synthetic nurse notes which faithfully describe nursing home residents with met or unmet palliative care needs. We used LangChain to create a repeatable pipeline which can be adapted to different use-cases. We also compare the performance of both models using a set of qualitative and quantitative evaluations to determine which set of notes is more suitable for subsequent research.

Results:

GPT-3.5 performed slightly better than GPT-4o in our qualitative healthcare professional analysis. Quantitative analysis revealed appropriately heterogenous results across contextual similarity, lexical overlap, sentiment, and readability scores.

Conclusion:

Our work is the first investigation of such a generation method in the nursing home palliative care domain. Further refinement and validation of such data is needed in order to ensure the safe use of our approach.

目的：随着世界人口的老龄化，养老院变得越来越重要。为了照顾越来越多的老年人，需要智能技术。人工智能可以用来加强养老院的姑息治疗。然而，由于隐私问题，在这个敏感领域缺乏训练人工智能代理所需的数据。因此，研究人员很难制定技术解决方案。随着大型语言模型（如ChatGPT）的出现，使用有限数据的新文本生成方法成为可能。在这个试点研究中，我们研究了使用大型语言模型来生成合成数据。方法：探讨利用GPT-3.5和gpt - 40模型，结合一次性提示，制作真实描述满足或未满足姑息治疗需求的疗养院居民的合成护理笔记的可行性。我们使用LangChain创建了一个可重复的管道，它可以适应不同的用例。我们还使用一组定性和定量评价来比较两种模型的性能，以确定哪一组笔记更适合后续研究。结果：GPT-3.5在定性医疗专业分析中的表现略好于gpt - 40。定量分析揭示了上下文相似性、词汇重叠、情感和可读性得分的适当异质性结果。结论：我们的工作是在养老院姑息治疗领域的这种生成方法的第一次调查。为了确保我们的方法的安全使用，需要进一步改进和验证这些数据。

{"title":"A LangChain-based pipeline for one-shot synthetic text generation using generative pre-trained transformers in palliative care research","authors":"Isabel Ronan , Patrice Crowley , Eva Rombouts , Nicola Cornally , Mohamad M. Saab , David Murphy , Sabin Tabirca","doi":"10.1016/j.jbi.2025.104936","DOIUrl":"10.1016/j.jbi.2025.104936","url":null,"abstract":"<div><h3>Objective:</h3><div>As the world’s population ages, nursing homes are of increasing importance. In order to care for a growing number of older adults, intelligent technologies are needed. Artificial Intelligence can be utilised to enhance palliative care in nursing homes. However, the data needed to train artificially intelligent agents is lacking within this sensitive domain due to privacy issues. Therefore, it is difficult for researchers to develop technological solutions. With the advent of large language models, such as ChatGPT, new text generation methods are made possible using limited data. In this pilot study, we investigate the use of large language models to generate synthetic data.</div></div><div><h3>Methods:</h3><div>We investigate the feasibility of using GPT-3.5 and GPT-4o models along with one-shot prompting to produce synthetic nurse notes which faithfully describe nursing home residents with met or unmet palliative care needs. We used LangChain to create a repeatable pipeline which can be adapted to different use-cases. We also compare the performance of both models using a set of qualitative and quantitative evaluations to determine which set of notes is more suitable for subsequent research.</div></div><div><h3>Results:</h3><div>GPT-3.5 performed slightly better than GPT-4o in our qualitative healthcare professional analysis. Quantitative analysis revealed appropriately heterogenous results across contextual similarity, lexical overlap, sentiment, and readability scores.</div></div><div><h3>Conclusion:</h3><div>Our work is the first investigation of such a generation method in the nursing home palliative care domain. Further refinement and validation of such data is needed in order to ensure the safe use of our approach.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104936"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145300760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A comprehensive evaluation framework for synthetic medical tabular data generation 合成医学表格数据生成的综合评价框架。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-10-14 DOI: 10.1016/j.jbi.2025.104939

Anastasia Kurakova, Hajar Homayouni

Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: https://github.com/akurakova/SDE_Framework

机器学习（ML）应用使医疗保健领域取得了重大进步，例如预测流行病、个性化治疗和开发救生药物。然而，机器学习模型训练需要大型数据集，而由于隐私问题，这些数据集在医疗保健领域很难获得。合成数据生成提供了一个很有前途的解决方案，它在保护患者隐私的同时提供了对大规模训练数据的访问。我们的研究聚焦于表格式医疗数据，电子健康记录（EHRs）的主要格式，并引入了一个综合评估框架，从四个关键维度评估合成数据：质量、隐私、可用性和数据生成过程的计算复杂性。该框架确保合成数据与ML应用程序的真实数据保持足够的相似性，同时保护患者的机密性。为了验证我们的方法，我们应用了六个最先进的（SOTA）生成模型来生成合成医疗数据集，并在我们的框架内对它们进行了评估。与主要关注统计相似性的传统方法相比，我们的框架提供了更广泛的评估，包括异常值检测、隐私风险和特定领域的约束。我们的研究结果表明，我们的框架可以识别合成数据生成模型中的关键缺陷，例如重复行的放大和超出范围值的生成，这些都被传统的统计评估方法所忽视。我们的框架实现可以在：https://github.com/akurakova/SDE_Framework上找到。

{"title":"A comprehensive evaluation framework for synthetic medical tabular data generation","authors":"Anastasia Kurakova, Hajar Homayouni","doi":"10.1016/j.jbi.2025.104939","DOIUrl":"10.1016/j.jbi.2025.104939","url":null,"abstract":"<div><div>Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: <span><span>https://github.com/akurakova/SDE_Framework</span><svg><path></path></svg></span></div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104939"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Review of tools to support Target Trial Emulation 回顾支持目标试验仿真的工具。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-11-01 Epub Date: 2025-09-26 DOI: 10.1016/j.jbi.2025.104897

Christina A. van Hal , Elmer V. Bernstam , Todd R. Johnson

Objective:

Randomized Controlled Trials (RCTs) are the gold standard for clinical evidence, but ethical and practical constraints sometimes necessitate or warrant the use of observational data. The aim of this study is to identify informatics tools that support the design and conduct of Target Trial Emulations (TTEs), a framework for designing observational studies that closely emulate RCTs so as to minimize biases that often arise when using real-world evidence (RWE) to estimate causal effects.

Methods:

We divided the process of conducting TTEs into three phases and seven steps. We then systematically reviewed the literature to identify currently available tools that support one or more of the seven steps required to conduct a TTE. For each tool, we noted which step or steps the tool supports.

Results:

7625 papers were included in the initial review, with 76 meeting our inclusion criteria. Our review identified 24 distinct tools applicable to the three phases of TTE. Specifically, 3 tools support the Design Phase, 5 support the Implementation Phase, and 19 support the Analysis Phase, with some tools applicable to multiple phases.

Conclusion:

This review revealed significant gaps in tool support for the Design Phase of TTEs, while support for the Implementation and Analysis phases was highly variable. No single tool currently supports all aspects of TTEs from start to finish and few tools are interoperable, meaning they cannot be easily integrated into a unified workflow. The results highlight the need for further development of informatics tools for supporting TTEs.

目的：随机对照试验（RCTs）是临床证据的黄金标准，但伦理和实践限制有时需要或保证使用观察性数据。本研究的目的是确定支持目标试验模拟（TTEs）设计和实施的信息学工具，目标试验模拟是一种设计密切模仿随机对照试验的观察性研究的框架，以便最大限度地减少使用真实世界证据（RWE）估计因果效应时经常出现的偏差。方法：将其分为3个阶段和7个步骤。然后，我们系统地回顾了文献，以确定当前可用的工具，这些工具支持进行TTE所需的七个步骤中的一个或多个步骤。对于每个工具，我们记录了该工具支持的步骤。结果：初审共纳入7625篇论文，其中76篇符合我们的纳入标准。我们的审查确定了适用于TTE三个阶段的24种不同工具。具体来说，有3个工具支持设计阶段，5个工具支持实现阶段，19个工具支持分析阶段，其中一些工具适用于多个阶段。结论：这篇综述揭示了对设计阶段的工具支持的显著差距，而对实施和分析阶段的支持是高度可变的。目前还没有一个工具能够从头到尾支持tte的所有方面，而且很少有工具是可互操作的，这意味着它们不能很容易地集成到一个统一的工作流中。研究结果强调了进一步开发支持tts的信息学工具的必要性。

{"title":"Review of tools to support Target Trial Emulation","authors":"Christina A. van Hal , Elmer V. Bernstam , Todd R. Johnson","doi":"10.1016/j.jbi.2025.104897","DOIUrl":"10.1016/j.jbi.2025.104897","url":null,"abstract":"<div><h3>Objective:</h3><div>Randomized Controlled Trials (RCTs) are the gold standard for clinical evidence, but ethical and practical constraints sometimes necessitate or warrant the use of observational data. The aim of this study is to identify informatics tools that support the design and conduct of Target Trial Emulations (TTEs), a framework for designing observational studies that closely emulate RCTs so as to minimize biases that often arise when using real-world evidence (RWE) to estimate causal effects.</div></div><div><h3>Methods:</h3><div>We divided the process of conducting TTEs into three phases and seven steps. We then systematically reviewed the literature to identify currently available tools that support one or more of the seven steps required to conduct a TTE. For each tool, we noted which step or steps the tool supports.</div></div><div><h3>Results:</h3><div>7625 papers were included in the initial review, with 76 meeting our inclusion criteria. Our review identified 24 distinct tools applicable to the three phases of TTE. Specifically, 3 tools support the Design Phase, 5 support the Implementation Phase, and 19 support the Analysis Phase, with some tools applicable to multiple phases.</div></div><div><h3>Conclusion:</h3><div>This review revealed significant gaps in tool support for the Design Phase of TTEs, while support for the Implementation and Analysis phases was highly variable. No single tool currently supports all aspects of TTEs from start to finish and few tools are interoperable, meaning they cannot be easily integrated into a unified workflow. The results highlight the need for further development of informatics tools for supporting TTEs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104897"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145185991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0