首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Pseudo-labeling and knowledge-guided contrastive learning for radiology report generation 伪标记与知识导向对比学习在放射学报告生成中的应用。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-22 DOI: 10.1016/j.jbi.2025.104941
Fan Ye , Xuan Hu , Yihao Ding , Feifei Liu

Objective:

Radiology report generation (RRG) is a transformative technology in the field of radiology imaging that aims to address the critical need for consistency and comprehensiveness in diagnostic interpretation. Although recent advances in graph-based representation learning have demonstrated excellent performance in disease progression modeling, their application in radiology report generation still suffers from three inherent limitations: (i) semantic separation between local image features and free-text descriptions, (ii) inherent noise in automated medical concept annotation, and (iii) lack of anatomical constraints in cross-modal attention mechanisms.

Method:

This study proposes a pseudo-label and knowledge-guided comparative learning (PKCL) framework, which addresses the above issues through a novel fusion of dynamic query learning and knowledge-guided contrastive learning. The PKCL framework employs a trainable cross-modal query matrix (QM) to learn shared representations through parameter-sharing self-attention mechanisms between imaging and text encoders. The QM is used during training to query disease-related visual regions in reports and enables dynamic alignment between radiological features and textual descriptions during both training and inference. Additionally, this method combines pseudo labels with an adaptive top-k weighted feature fusion strategy to enhance learning from standard comparisons and leverages pre-built knowledge graphs via the XRayVision (Cohen et al., 2022) model to account for disease relationships and anatomical dependencies, thereby improving the clinical accuracy of generated reports.

Results:

Comprehensive evaluations on the IU-Xray and MIMIC-CXR datasets demonstrate that PKCL achieves state-of-the-art performance on both natural language generation metrics and clinical efficacy metrics. Specifically, it obtains 0.499 BLEU-1 and 0.374 RL on IU-Xray, and 0.346 BLEU-1 and 0.277 RL on MIMIC-CXR, outperforming prior methods such as R2GEN and CMCL.
Furthermore, PKCL exhibited robust generalization on the out-of-domain Montgomery County X-ray Set, effectively handling its low-resource conditions and brief, diagnostic-level textual supervision.

Conclusion:

The framework’s ability to maintain semantic consistency when generating clinically relevant reports represents a significant advancement over existing methods, particularly in capturing the subtle relationships between radiological findings and their textual descriptions.
目的:放射学报告生成(RRG)是放射学成像领域的一项变革性技术,旨在解决诊断解释中一致性和全面性的关键需求。尽管基于图的表示学习的最新进展在疾病进展建模方面表现出色,但其在放射学报告生成中的应用仍然存在三个固有限制:(i)局部图像特征和自由文本描述之间的语义分离,(ii)自动医学概念注释中的固有噪声,以及(iii)跨模态注意机制缺乏解剖约束。方法:本研究提出了一个伪标签和知识引导比较学习(PKCL)框架,该框架通过动态查询学习和知识引导对比学习的新颖融合来解决上述问题。PKCL框架采用可训练的跨模态查询矩阵(QM),通过图像和文本编码器之间的参数共享自注意机制来学习共享表示。QM在训练期间用于查询报告中与疾病相关的视觉区域,并在训练和推理期间实现放射特征和文本描述之间的动态对齐。此外,该方法将伪标签与自适应top-k加权特征融合策略相结合,以增强从标准比较中学习的能力,并通过XRayVision (Cohen等人,2022)模型利用预先构建的知识图来解释疾病关系和解剖依赖性,从而提高生成报告的临床准确性。结果:对iu - x射线和MIMIC-CXR数据集的综合评估表明,PKCL在自然语言生成指标和临床疗效指标上都达到了最先进的水平。具体而言,该方法在IU-Xray上获得0.499 BLEU-1和0.374 RL,在MIMIC-CXR上获得0.346 BLEU-1和0.277 RL,优于R2GEN和CMCL等先前的方法。此外,PKCL在域外蒙哥马利县x射线集上表现出鲁棒泛化,有效地处理了其低资源条件和简短的诊断级文本监督。结论:该框架在生成临床相关报告时保持语义一致性的能力代表了现有方法的重大进步,特别是在捕捉放射学发现与其文本描述之间的微妙关系方面。
{"title":"Pseudo-labeling and knowledge-guided contrastive learning for radiology report generation","authors":"Fan Ye ,&nbsp;Xuan Hu ,&nbsp;Yihao Ding ,&nbsp;Feifei Liu","doi":"10.1016/j.jbi.2025.104941","DOIUrl":"10.1016/j.jbi.2025.104941","url":null,"abstract":"<div><h3>Objective:</h3><div>Radiology report generation (RRG) is a transformative technology in the field of radiology imaging that aims to address the critical need for consistency and comprehensiveness in diagnostic interpretation. Although recent advances in graph-based representation learning have demonstrated excellent performance in disease progression modeling, their application in radiology report generation still suffers from three inherent limitations: (i) semantic separation between local image features and free-text descriptions, (ii) inherent noise in automated medical concept annotation, and (iii) lack of anatomical constraints in cross-modal attention mechanisms.</div></div><div><h3>Method:</h3><div>This study proposes a pseudo-label and knowledge-guided comparative learning (PKCL) framework, which addresses the above issues through a novel fusion of dynamic query learning and knowledge-guided contrastive learning. The PKCL framework employs a trainable cross-modal query matrix (QM) to learn shared representations through parameter-sharing self-attention mechanisms between imaging and text encoders. The QM is used during training to query disease-related visual regions in reports and enables dynamic alignment between radiological features and textual descriptions during both training and inference. Additionally, this method combines pseudo labels with an adaptive top-k weighted feature fusion strategy to enhance learning from standard comparisons and leverages pre-built knowledge graphs via the XRayVision (Cohen et al., 2022) model to account for disease relationships and anatomical dependencies, thereby improving the clinical accuracy of generated reports.</div></div><div><h3>Results:</h3><div>Comprehensive evaluations on the IU-Xray and MIMIC-CXR datasets demonstrate that PKCL achieves state-of-the-art performance on both natural language generation metrics and clinical efficacy metrics. Specifically, it obtains 0.499 BLEU-1 and 0.374 RL on IU-Xray, and 0.346 BLEU-1 and 0.277 RL on MIMIC-CXR, outperforming prior methods such as R2GEN and CMCL.</div><div>Furthermore, PKCL exhibited robust generalization on the out-of-domain Montgomery County X-ray Set, effectively handling its low-resource conditions and brief, diagnostic-level textual supervision.</div></div><div><h3>Conclusion:</h3><div>The framework’s ability to maintain semantic consistency when generating clinically relevant reports represents a significant advancement over existing methods, particularly in capturing the subtle relationships between radiological findings and their textual descriptions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104941"},"PeriodicalIF":4.5,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145368032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records 从纵向患者记录中发现胰腺癌和软组织肉瘤的标志性疾病轨迹。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-19 DOI: 10.1016/j.jbi.2025.104935
Liwei Wang , Rui Li , Andrew Wen , Qiuhao Lu , Jinlian Wang , Xiaoyang Ruan , Adriana Gamboa , Neha Malik , Christina L. Roland , Matthew H.G. Katz , Heather Lyu , Hongfang Liu

Background

Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.

Objectives

We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).

Materials and Methods

Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).

Results

Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.

Conclusion

We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.
背景:大多数临床医生对罕见病的经验有限,使得诊断和治疗具有挑战性。电子健康记录(EHRs)等大型真实世界数据源提供了大量信息,可用于确定罕见肿瘤的诊断和治疗模式,从而作为临床决策辅助工具。目的:研究胰腺癌、躯干及四肢STS (STS- te)和腹部及腹膜后STS (STS- ar) 3种罕见肿瘤的特征发病轨迹。材料和方法:利用IQVIA肿瘤电子病历,我们通过匹配队列抽样、统计计算、右尾二项假设检验,确定了这些癌症患者在3 年内的显著诊断对,然后可视化了3个进展的轨迹。我们进一步用UTHealth电子健康记录(EHR)对发现的轨迹进行了系统验证。结果:结果包括266对胰腺癌,130对STS-TE, 118对STS-AR的显著诊断。我们进一步发现胰腺癌前44个2-跳(即2-进展)和136个3-跳轨迹,STS-TE前36个2-跳和37个3-跳轨迹,STS-AR前17个2-跳和5个3-跳轨迹。同时,我们发现胰腺癌后有54个2-跳和129个3-跳轨迹,STS-TE后有11个2-跳和17个3-跳轨迹,STS-AR后有5个2-跳和0个3-跳轨迹。例如,64例(0.5%)患者在胰腺癌前出现关节痛和胃食管反流病,40例(0.9%)患者在STS-TE前出现关节痛和“四肢、手、脚、手指和脚趾痛”,256例(1.9%)患者在胰腺癌后出现癌症化疗后继发粒细胞缺乏症和肿瘤相关疼痛。使用UTHealth电子病历系统验证了所发现轨迹的有效性。结论:通过利用大规模电子病历数据和轨迹挖掘方法,我们确定了所研究的罕见癌症的标志性疾病轨迹。这些疾病轨迹可以作为临床医生的潜在资源,加深他们对这些罕见癌症之前和之后病情的时间进展的理解,进一步为患者护理决策提供信息。
{"title":"Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records","authors":"Liwei Wang ,&nbsp;Rui Li ,&nbsp;Andrew Wen ,&nbsp;Qiuhao Lu ,&nbsp;Jinlian Wang ,&nbsp;Xiaoyang Ruan ,&nbsp;Adriana Gamboa ,&nbsp;Neha Malik ,&nbsp;Christina L. Roland ,&nbsp;Matthew H.G. Katz ,&nbsp;Heather Lyu ,&nbsp;Hongfang Liu","doi":"10.1016/j.jbi.2025.104935","DOIUrl":"10.1016/j.jbi.2025.104935","url":null,"abstract":"<div><h3>Background</h3><div>Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.</div></div><div><h3>Objectives</h3><div>We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).</div></div><div><h3>Materials and Methods</h3><div>Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).</div></div><div><h3>Results</h3><div>Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.</div></div><div><h3>Conclusion</h3><div>We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104935"},"PeriodicalIF":4.5,"publicationDate":"2025-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data 基于加密垂直分区数据的非交互式在线医疗预诊断系统。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-17 DOI: 10.1016/j.jbi.2025.104940
Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng

Objective:

In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.

Methods:

We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.

Results:

Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6× higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.

Conclusion:

PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.
目的:在医疗环境中,患者记录作为异构特征存储在各个机构中,由于法律或制度的限制,禁止原始数据共享。这种碎片化给在线医疗预诊断(OMPD)系统带来了挑战。现有的方法(如联邦学习)需要在所有参与方(医院和云服务器)之间进行多轮交互,从而导致频繁的通信。此外,由于全局梯度的共享,它们容易受到推理攻击,导致信息泄露。本文提出了一种安全高效的OMPD系统框架来解决垂直数据碎片化问题,旨在解决医疗数据隔离与模型协作之间的矛盾。方法:提出了一种用于构建OMPD系统的安全框架PPNLR。该框架将功能加密和盲因子相结合,设计了样本特征维数加密算法和隐私保护矢量化训练算法。将样本计算与模型训练解耦,仅在医院和云服务器之间进行一次通信即可实现跨客户端数据聚合。结果:安全性分析表明,PPNLR能够抵抗半诚实推理攻击和串通攻击。基于6个真实医学数据集(文本和图像)的评估结果表明:(1)推理准确率接近集中式明文训练基准;(ii)计算效率至少比可比方法高3.6倍;(iii)通过消除对迭代计数的依赖,显著降低了通信复杂性。结论:PPNLR通过加密原语实现了数据保护,在保证医疗数据和模型参数安全的同时,保持了较高的诊断准确率。它的单通信体系结构显著降低了资源受限场景中的部署门槛,为构建隐私友好型OMPD系统提供了实用框架。
{"title":"A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data","authors":"Min Tang ,&nbsp;Yuhao Zhang ,&nbsp;Ronghua Liang ,&nbsp;Guoqiang Deng","doi":"10.1016/j.jbi.2025.104940","DOIUrl":"10.1016/j.jbi.2025.104940","url":null,"abstract":"<div><h3>Objective:</h3><div>In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.</div></div><div><h3>Methods:</h3><div>We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.</div></div><div><h3>Results:</h3><div>Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6<span><math><mo>×</mo></math></span> higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.</div></div><div><h3>Conclusion:</h3><div>PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104940"},"PeriodicalIF":4.5,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthetic-to-real attentive deep learning for Alzheimer’s assessment: A domain-agnostic framework for ROCF scoring 用于阿尔茨海默氏症评估的综合到真实的专注深度学习:ROCF评分的领域不可知框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-17 DOI: 10.1016/j.jbi.2025.104929
Kassem Anis Bouali, Elena Šikudová

Objective:

Early diagnosis of Alzheimer’s disease depends on accessible cognitive assessments, such as the Rey-Osterrieth Complex Figure (ROCF) test. However, manual scoring of this test is labor-intensive and subjective, which introduces experimental biases. Additionally, deep learning models face challenges due to the limited availability of annotated clinical data, particularly for assessments like the ROCF test. This scarcity of data restricts model generalization and exacerbates domain shifts across different populations.

Methods:

We propose a novel framework comprising a data synthesis pipeline and ROCF-Net, a deep learning model specifically designed for ROCF scoring. The synthesis pipeline is lightweight and capable of generating realistic, diverse, and annotated ROCF drawings. ROCF-Net, on the other hand, is a cross-domain scoring model engineered to address domain discrepancies in stroke texture and line artifacts. It maintains high scoring accuracy through a novel line-specific attention mechanism tailored to the unique characteristics of ROCF drawings.

Results:

Unlike conventional synthetic medical imaging methods, our approach generates ROCF drawings that accurately reflect Alzheimer’s-specific abnormalities with minimal computational cost. Our scoring model achieves SOTA performance across differently sourced datasets, with a Mean Absolute Error (MAE) of 3.53 and a Pearson Correlation Coefficient (PCC) of 0.86. This demonstrates both high predictive accuracy and computational efficiency, outperforming existing ROCF scoring methods that rely on Convolutional Neural Networks (CNNs) while avoiding the overhead of parameter-heavy transformer models. We also show that training on our synthetic data generalizes as well as training on real clinical data, where the difference in performance was minimal (MAE differed by 1.43 and PCC by 0.07), indicating no statistically significant performance gap.

Conclusion:

Our work introduces four contributions: (1) a cost-effective pipeline for generating synthetic ROCF data, reducing dependency on clinical datasets; (2) a domain-agnostic model for automated ROCF scoring across diverse drawing styles; (3) a lightweight attention mechanism aligning model decisions with clinical scoring for transparency; and (4) a bias-aware framework using synthetic data to reduce demographic disparities, promoting fair cognitive assessment across populations.
目的:阿尔茨海默病的早期诊断依赖于可获得的认知评估,如Rey-Osterrieth Complex Figure (ROCF)测试。然而,手工评分是劳动密集型和主观的,这引入了实验偏差。此外,由于标注临床数据的可用性有限,深度学习模型面临挑战,特别是对于像ROCF测试这样的评估。这种数据的稀缺性限制了模型的泛化,并加剧了不同人群之间的领域转移。方法:我们提出了一个新的框架,包括一个数据合成管道和ROCF- net,一个专门为ROCF评分设计的深度学习模型。合成管道是轻量级的,能够生成真实的、多样的、带注释的ROCF图纸。另一方面,ROCF-Net是一个跨域评分模型,用于解决笔画纹理和线条伪像中的域差异。它通过针对ROCF图纸的独特特征量身定制的新颖的线特定注意机制保持高评分精度。结果:与传统的合成医学成像方法不同,我们的方法以最小的计算成本生成准确反映阿尔茨海默病特异性异常的ROCF图。我们的评分模型在不同来源的数据集上实现了SOTA性能,平均绝对误差(MAE)为3.53,Pearson相关系数(PCC)为0.86。这证明了高预测精度和计算效率,优于现有的依赖卷积神经网络(cnn)的ROCF评分方法,同时避免了重参数变压器模型的开销。我们还表明,在我们的合成数据上的训练与在真实临床数据上的训练一样一般化,其中性能差异很小(MAE差1.43,PCC差0.07),表明没有统计学上显著的性能差距。结论:我们的工作引入了四个贡献:(1)成本效益高的管道生成合成ROCF数据,减少对临床数据集的依赖;(2)跨不同画风的自动ROCF评分的领域不可知模型;(3)将模型决策与临床透明度评分相结合的轻量级注意机制;(4)利用综合数据构建偏见感知框架,减少人口差异,促进人群间的公平认知评估。
{"title":"Synthetic-to-real attentive deep learning for Alzheimer’s assessment: A domain-agnostic framework for ROCF scoring","authors":"Kassem Anis Bouali,&nbsp;Elena Šikudová","doi":"10.1016/j.jbi.2025.104929","DOIUrl":"10.1016/j.jbi.2025.104929","url":null,"abstract":"<div><h3>Objective:</h3><div>Early diagnosis of Alzheimer’s disease depends on accessible cognitive assessments, such as the Rey-Osterrieth Complex Figure (ROCF) test. However, manual scoring of this test is labor-intensive and subjective, which introduces experimental biases. Additionally, deep learning models face challenges due to the limited availability of annotated clinical data, particularly for assessments like the ROCF test. This scarcity of data restricts model generalization and exacerbates domain shifts across different populations.</div></div><div><h3>Methods:</h3><div>We propose a novel framework comprising a data synthesis pipeline and ROCF-Net, a deep learning model specifically designed for ROCF scoring. The synthesis pipeline is lightweight and capable of generating realistic, diverse, and annotated ROCF drawings. ROCF-Net, on the other hand, is a cross-domain scoring model engineered to address domain discrepancies in stroke texture and line artifacts. It maintains high scoring accuracy through a novel line-specific attention mechanism tailored to the unique characteristics of ROCF drawings.</div></div><div><h3>Results:</h3><div>Unlike conventional synthetic medical imaging methods, our approach generates ROCF drawings that accurately reflect Alzheimer’s-specific abnormalities with minimal computational cost. Our scoring model achieves SOTA performance across differently sourced datasets, with a Mean Absolute Error (MAE) of 3.53 and a Pearson Correlation Coefficient (PCC) of 0.86. This demonstrates both high predictive accuracy and computational efficiency, outperforming existing ROCF scoring methods that rely on Convolutional Neural Networks (CNNs) while avoiding the overhead of parameter-heavy transformer models. We also show that training on our synthetic data generalizes as well as training on real clinical data, where the difference in performance was minimal (MAE differed by 1.43 and PCC by 0.07), indicating no statistically significant performance gap.</div></div><div><h3>Conclusion:</h3><div>Our work introduces four contributions: (1) a cost-effective pipeline for generating synthetic ROCF data, reducing dependency on clinical datasets; (2) a domain-agnostic model for automated ROCF scoring across diverse drawing styles; (3) a lightweight attention mechanism aligning model decisions with clinical scoring for transparency; and (4) a bias-aware framework using synthetic data to reduce demographic disparities, promoting fair cognitive assessment across populations.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104929"},"PeriodicalIF":4.5,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data 构建过采样(BEFO)基因表达数据生物学评价框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-17 DOI: 10.1016/j.jbi.2025.104932
Kevin Fee , Suneil Jain , Ross G. Murphy , Anna Jurek-Loughrey
Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.
机器学习(ML)技术正逐渐被用于生物医学研究,以提高诊断和预后的准确性,当与临床医生一起作为决策支持系统使用时。然而,生物医学研究中使用的许多数据集往往由于人口规模小而存在严重的类不平衡,这导致机器学习模型偏向于大多数类样本。目前的过采样方法主要侧重于平衡数据集,而没有充分验证合成数据的生物学相关性,这可能会影响下游模型预测的临床适用性。为了解决这些缺点,我们提出了过采样生物评估框架(BEFO),旨在确保合成基因表达样本准确反映原始数据集中存在的生物模式。这一创新不仅减轻了偏见,而且提高了预测模型在临床场景中的可信度。我们在此基础上开发了一种合成样品的排名方法,并根据其排名评估每个样品的包含情况。该排序方法在原始数据集上计算WGCNA基因共表达簇。构建了几个随机森林来评估每个合成样本与每个簇的对齐情况。只有比真实样本更重要的合成样本才会被纳入研究。实验结果表明,与五种最先进的(SOTA)过采样方法和十种分类算法相比,我们提出的ML过采样框架可以将过采样数据集的生物学可行性平均提高11%,从而在六个真实世界的基因表达数据集上平均提高9%的分类性能,从而为生物医学ML应用中的合成数据评估建立了新的标准。
{"title":"Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data","authors":"Kevin Fee ,&nbsp;Suneil Jain ,&nbsp;Ross G. Murphy ,&nbsp;Anna Jurek-Loughrey","doi":"10.1016/j.jbi.2025.104932","DOIUrl":"10.1016/j.jbi.2025.104932","url":null,"abstract":"<div><div>Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104932"},"PeriodicalIF":4.5,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TriMedPrompt: A unified prompting framework for realistic and layout-conformant clinical progress note synthesis TriMedPrompt:一个统一的提示框架,用于现实和符合布局的临床进展记录合成。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-17 DOI: 10.1016/j.jbi.2025.104927
Garapati Keerthana, Manik Gupta
Clinical progress notes are critical artifacts for modeling patient trajectories, auditing clinical decision-making, and powering downstream applications in clinical natural language processing (NLP). However, public resources such as MIMIC-III provide limited progress notes, constraining the development of robust and generalizable machine learning models. This work proposes a novel hybrid prompting framework — TriMedPrompt — to generate high-quality, structurally and semantically coherent synthetic progress notes using large language models (LLMs). Our approach conditions the LLMs on a triad of complementary biomedical signals: (1) real-world progress notes from MIMIC-III, (2) clinically aligned case reports from the PMC Patients dataset, selected via embedding-based retrieval, and (3) structured disease-centric knowledge from PrimeKG. We design a multi-source, layout-aware prompting pipeline that dynamically integrates structured and unstructured information to produce notes across standard clinical formats (e.g., SOAP, BIRP, PIE, DAP).
Through rigorous evaluations—including layout adherence, entity extraction comparisons, semantic similarity analysis, and controlled ablations, we demonstrate that our generated notes achieve a 98.6% semantic entity alignment score with real clinical notes, while maintaining high structural fidelity. Ablation studies further confirm the critical role of combining structured biomedical knowledge and unstructured narrative data in improving note quality. In addition, we illustrate the potential of our synthetic notes in privacy-preserving clinical NLP, offering a safe alternative for model development and benchmarking in sensitive healthcare settings. This work establishes a scalable, controllable paradigm for clinical text synthesis, significantly expanding access to realistic, diverse progress notes and laying the foundation for advancing trustworthy clinical NLP research.
临床进展记录是为患者轨迹建模、审计临床决策以及为临床自然语言处理(NLP)的下游应用提供动力的关键人工制品。然而,像MIMIC-III这样的公共资源提供了有限的进展记录,限制了健壮和可推广的机器学习模型的发展。这项工作提出了一个新的混合提示框架- TriMedPrompt -使用大型语言模型(llm)生成高质量,结构和语义连贯的合成进度记录。我们的方法以三个互补的生物医学信号为llm条件:(1)来自MIMIC-III的现实世界进展记录,(2)通过基于嵌入的检索选择的PMC患者数据集的临床一致病例报告,以及(3)来自PrimeKG的结构化疾病中心知识。我们设计了一个多源、布局感知的提示管道,动态集成结构化和非结构化信息,以生成跨标准临床格式(例如SOAP、BIRP、PIE、DAP)的笔记。通过严格的评估,包括布局一致性、实体提取比较、语义相似性分析和控制消融,我们证明了我们生成的笔记与真实临床笔记的语义实体一致性得分达到98.6%,同时保持了较高的结构保真度。消融研究进一步证实了结构化生物医学知识与非结构化叙事数据相结合在提高病历质量中的关键作用。此外,我们还说明了我们的合成笔记在保护隐私的临床NLP中的潜力,为敏感医疗保健环境中的模型开发和基准测试提供了安全的替代方案。这项工作为临床文本合成建立了一个可扩展的、可控的范例,大大扩展了对现实的、多样化的进展记录的访问,并为推进值得信赖的临床NLP研究奠定了基础。
{"title":"TriMedPrompt: A unified prompting framework for realistic and layout-conformant clinical progress note synthesis","authors":"Garapati Keerthana,&nbsp;Manik Gupta","doi":"10.1016/j.jbi.2025.104927","DOIUrl":"10.1016/j.jbi.2025.104927","url":null,"abstract":"<div><div>Clinical progress notes are critical artifacts for modeling patient trajectories, auditing clinical decision-making, and powering downstream applications in clinical natural language processing (NLP). However, public resources such as MIMIC-III provide limited progress notes, constraining the development of robust and generalizable machine learning models. This work proposes a novel hybrid prompting framework — TriMedPrompt — to generate high-quality, structurally and semantically coherent synthetic progress notes using large language models (LLMs). Our approach conditions the LLMs on a triad of complementary biomedical signals: (1) real-world progress notes from MIMIC-III, (2) clinically aligned case reports from the PMC Patients dataset, selected via embedding-based retrieval, and (3) structured disease-centric knowledge from PrimeKG. We design a multi-source, layout-aware prompting pipeline that dynamically integrates structured and unstructured information to produce notes across standard clinical formats (e.g., SOAP, BIRP, PIE, DAP).</div><div>Through rigorous evaluations—including layout adherence, entity extraction comparisons, semantic similarity analysis, and controlled ablations, we demonstrate that our generated notes achieve a 98.6% semantic entity alignment score with real clinical notes, while maintaining high structural fidelity. Ablation studies further confirm the critical role of combining structured biomedical knowledge and unstructured narrative data in improving note quality. In addition, we illustrate the potential of our synthetic notes in privacy-preserving clinical NLP, offering a safe alternative for model development and benchmarking in sensitive healthcare settings. This work establishes a scalable, controllable paradigm for clinical text synthesis, significantly expanding access to realistic, diverse progress notes and laying the foundation for advancing trustworthy clinical NLP research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104927"},"PeriodicalIF":4.5,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated analysis for electronic health records with structured and sporadic missingness 具有结构化和零星缺失的电子健康记录的综合分析。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-16 DOI: 10.1016/j.jbi.2025.104933
Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang

Objectives:

We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.

Materials and Methods:

We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).

Results:

Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.

Discussion:

The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.

Conclusions:

We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
目的:我们提出了一种针对具有结构化和偶发缺失的电子健康记录(EHRs)量身定制的新型imputation方法。这种缺失经常出现在下游临床应用异构电子病历数据集的整合中。通过解决这些差距,我们的方法为综合分析提供了一个实用的解决方案,增强了数据的效用,促进了对人口健康的理解。材料和方法:我们首先在电子病历数据的综合分析中展示结构化和零星的缺失机制。在此之后,我们引入了一个新的imputation框架Macomss,专门设计用于处理结构和异构发生的缺失数据。我们为Macomss建立理论保证,确保其在保持综合分析的完整性和可靠性方面的稳健性。为了评估其经验性能,我们进行了广泛的模拟研究,复制了在现实世界的电子病历系统中观察到的复杂缺失模式,并使用杜克大学卫生系统(DUHS)的电子病历数据集进行了验证。结果:仿真研究表明,我们的方法始终优于现有的imputation方法。使用DUHS内三家医院的数据集,Macomss在大多数情况下实现了对缺失数据的最低输入误差,并且与基准方法相比,提供了优越或可比的下游预测性能。讨论:提出的方法有效地解决了电子病历数据集集成分析中出现的关键缺失模式,增强了临床预测的稳健性和泛化性。结论:我们为结构化和零星缺失数据的输入提供了一种理论保证和实践意义的方法,实现了多个EHR数据集的准确可靠的集成分析。提出的方法在推进人口健康研究方面具有重大潜力。
{"title":"Integrated analysis for electronic health records with structured and sporadic missingness","authors":"Jianbin Tan ,&nbsp;Yan Zhang ,&nbsp;Chuan Hong ,&nbsp;T. Tony Cai ,&nbsp;Tianxi Cai ,&nbsp;Anru R. Zhang","doi":"10.1016/j.jbi.2025.104933","DOIUrl":"10.1016/j.jbi.2025.104933","url":null,"abstract":"<div><h3>Objectives:</h3><div>We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.</div></div><div><h3>Materials and Methods:</h3><div>We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, <span>Macomss</span>, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for <span>Macomss</span>, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).</div></div><div><h3>Results:</h3><div>Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, <span>Macomss</span> achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.</div></div><div><h3>Discussion:</h3><div>The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.</div></div><div><h3>Conclusions:</h3><div>We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104933"},"PeriodicalIF":4.5,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications 推进医疗保健分析:机器学习,健康信息学和现实世界数据应用的专题审查。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-16 DOI: 10.1016/j.jbi.2025.104934
Maria I. Arias, Lorena Cadavid, Juan D. Velásquez

Objective

To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.

Methods

A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.

Results

Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.

Conclusions

The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.
目的:通过确定主要的专题集群,综合关键趋势,概述该领域的转化挑战和研究机会,绘制医疗保健分析的概念和方法景观。方法:采用无监督文本挖掘和聚类技术对共2,281篇scopus索引的出版物进行分析。分析的重点是确定临床、行政和公共卫生背景下医疗分析文献中反复出现的主题、方法创新和差距。结果:确定了八个主要主题:预测性医疗保健的智能系统、以患者为中心的健康分析、用于临床见解的自适应人工智能、人口健康分析、数字精神健康监测、用于健康监测的伦理分析、通过数据分析进行个性化护理,以及用于疫情应对的人工智能驱动的见解。这些反映了向实时、多模式和基于道德的分析生态系统的转变。持续存在的挑战包括数据互操作性、算法不透明、评估标准化和人口统计偏差。结论:该综述强调了新兴的优先事项,包括可解释的人工智能、联邦学习和情境感知建模,以及与数据隐私和数字公平相关的道德考虑。实用建议包括与医疗保健专业人员共同设计、投资基础设施以及部署实时临床决策支持。医疗保健分析被定位为学习卫生系统的基础支柱,对转化研究和精确健康具有广泛的影响。
{"title":"Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications","authors":"Maria I. Arias,&nbsp;Lorena Cadavid,&nbsp;Juan D. Velásquez","doi":"10.1016/j.jbi.2025.104934","DOIUrl":"10.1016/j.jbi.2025.104934","url":null,"abstract":"<div><h3>Objective</h3><div>To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.</div></div><div><h3>Methods</h3><div>A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.</div></div><div><h3>Results</h3><div>Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.</div></div><div><h3>Conclusions</h3><div>The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104934"},"PeriodicalIF":4.5,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable statistical modeling of patient flow in emergency departments 急诊科病人流动的可解释统计模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-14 DOI: 10.1016/j.jbi.2025.104937
Hugo Álvarez-Chaves, María D. R-Moreno

Objective:

This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.

Methods:

We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.

Results:

Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.

Conclusion:

The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.
目的:本文旨在开发一个数据驱动的仿真框架,在缺乏系统资源数据的情况下,使用可解释的方法对医院急诊科的整个流程进行建模。目标是提高对系统动力学的理解,并通过透明的模拟支持决策过程,即使在资源数据不可用的情况下也是如此。方法:我们利用西班牙一家医院急诊科的匿名医疗记录开发了一个模拟框架。该模型通过识别路线和测量每个阶段之间的过渡时间来捕获考虑分诊级别的患者流量。我们使用参数(理论)分布和非参数核密度估计(KDE)来估计这些过渡。病人入院时间用概率分布建模。我们通过由容忍阈值和定量度量指导的迭代改进过程增强了现实性。该过程将合成数据细化到与原始分布相匹配。结果:我们的方法在迭代方法中产生了具有低容差值的高度逼真的患者流模拟。这个过程逐渐向原始数据收敛。距离和散度指标以及统计测试结果表明,模拟与实际数据高度相似,当容差阈值较低时,100%的生成样本同时通过了Mann-Whitney U和Kolmogorov-Smirnov测试。结论:实验结果表明,即使在缺乏与服务资源相关的信息的情况下,我们的模拟方法也能有效地再现患者流动动力学,具有很高的真实感和灵活性。其可解释的设计和可调整的参数使安全的数据分析和探索替代管理策略(例如,修改潜在的病人路线或限制一些过渡)。这些特点使该方法成为支持知情决策的宝贵工具,并表明其在其他具有适当数据的医院中使用的潜力,有待外部数据集的验证。
{"title":"Interpretable statistical modeling of patient flow in emergency departments","authors":"Hugo Álvarez-Chaves,&nbsp;María D. R-Moreno","doi":"10.1016/j.jbi.2025.104937","DOIUrl":"10.1016/j.jbi.2025.104937","url":null,"abstract":"<div><h3>Objective:</h3><div>This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.</div></div><div><h3>Methods:</h3><div>We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.</div></div><div><h3>Results:</h3><div>Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.</div></div><div><h3>Conclusion:</h3><div>The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104937"},"PeriodicalIF":4.5,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive evaluation framework for synthetic medical tabular data generation 合成医学表格数据生成的综合评价框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-14 DOI: 10.1016/j.jbi.2025.104939
Anastasia Kurakova, Hajar Homayouni
Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: https://github.com/akurakova/SDE_Framework
机器学习(ML)应用使医疗保健领域取得了重大进步,例如预测流行病、个性化治疗和开发救生药物。然而,机器学习模型训练需要大型数据集,而由于隐私问题,这些数据集在医疗保健领域很难获得。合成数据生成提供了一个很有前途的解决方案,它在保护患者隐私的同时提供了对大规模训练数据的访问。我们的研究聚焦于表格式医疗数据,电子健康记录(EHRs)的主要格式,并引入了一个综合评估框架,从四个关键维度评估合成数据:质量、隐私、可用性和数据生成过程的计算复杂性。该框架确保合成数据与ML应用程序的真实数据保持足够的相似性,同时保护患者的机密性。为了验证我们的方法,我们应用了六个最先进的(SOTA)生成模型来生成合成医疗数据集,并在我们的框架内对它们进行了评估。与主要关注统计相似性的传统方法相比,我们的框架提供了更广泛的评估,包括异常值检测、隐私风险和特定领域的约束。我们的研究结果表明,我们的框架可以识别合成数据生成模型中的关键缺陷,例如重复行的放大和超出范围值的生成,这些都被传统的统计评估方法所忽视。我们的框架实现可以在:https://github.com/akurakova/SDE_Framework上找到。
{"title":"A comprehensive evaluation framework for synthetic medical tabular data generation","authors":"Anastasia Kurakova,&nbsp;Hajar Homayouni","doi":"10.1016/j.jbi.2025.104939","DOIUrl":"10.1016/j.jbi.2025.104939","url":null,"abstract":"<div><div>Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: <span><span>https://github.com/akurakova/SDE_Framework</span><svg><path></path></svg></span></div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104939"},"PeriodicalIF":4.5,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1