首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Scaling up biomedical vision-language models: Fine-tuning, instruction tuning, and multi-modal learning 扩大生物医学视觉语言模型:微调,指令调整和多模态学习。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-23 DOI: 10.1016/j.jbi.2025.104946
Cheng Peng , Kai Zhang , Mengxian Lyu , Hongfang Liu , Lichao Sun , Yonghui Wu

Objective

To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.

Methods

We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.

Results and Conclusion

The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.
目的:通过放大、微调和指令调优来提升生物医学视觉语言模型的能力,开发具有较好长文本处理性能的视觉语言模型,探索将视觉语言模型高效应用于多种多模态生物医学任务的策略,并检验其零采样学习性能。方法:基于基于编码器-解码器的转换器架构,我们开发了两个生物医学视觉语言模型:BiomedGPT-Large和BiomedGPT-XLarge。我们在来自6个多模态生物医学任务的23个基准数据集上对这两个模型进行了微调,这些任务包括一个纯图像任务(图像分类)、三个纯语言任务(文本理解、文本摘要和问答)和两个视觉语言任务(视觉问答和图像字幕)。我们将开发的缩放模型与我们之前的生物gpt基础模型和文献中报道的现有著名模型进行了比较。我们使用大规模的多模态生物医学指令调整数据集对这两个模型进行了指令调整,并评估了零射击学习性能和对准精度。结果和结论:实验结果表明,与文献中报道的模型相比,本研究开发的新模型在23个基准数据集中的17个上优于我们之前的生物gpt - base模型,在23个数据集中的15个上达到了最先进的性能。新模型在处理长文本方面的能力也有所提高,特别是在MIMIC-III数据集上的文本摘要和SEER数据集上的文本理解方面,显著提高了4.6-11.4 %。缩放模型上的指令调整显著增强了零射击学习能力和跨多个任务执行复杂指令的对齐精度,包括图像分类、视觉问题回答和图像字幕。本研究开发了两种生物医学领域的视觉语言模型,并研究了通过缩放、微调和指令调整来改善视觉语言模型中长文本内容的技术。本研究展示了视觉语言模型集成多种数据模态以解决生物医学领域多种多模态任务的潜力。
{"title":"Scaling up biomedical vision-language models: Fine-tuning, instruction tuning, and multi-modal learning","authors":"Cheng Peng ,&nbsp;Kai Zhang ,&nbsp;Mengxian Lyu ,&nbsp;Hongfang Liu ,&nbsp;Lichao Sun ,&nbsp;Yonghui Wu","doi":"10.1016/j.jbi.2025.104946","DOIUrl":"10.1016/j.jbi.2025.104946","url":null,"abstract":"<div><h3>Objective</h3><div>To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.</div></div><div><h3>Methods</h3><div>We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.</div></div><div><h3>Results and Conclusion</h3><div>The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104946"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145370338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable statistical modeling of patient flow in emergency departments 急诊科病人流动的可解释统计模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-14 DOI: 10.1016/j.jbi.2025.104937
Hugo Álvarez-Chaves, María D. R-Moreno

Objective:

This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.

Methods:

We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.

Results:

Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.

Conclusion:

The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.
目的:本文旨在开发一个数据驱动的仿真框架,在缺乏系统资源数据的情况下,使用可解释的方法对医院急诊科的整个流程进行建模。目标是提高对系统动力学的理解,并通过透明的模拟支持决策过程,即使在资源数据不可用的情况下也是如此。方法:我们利用西班牙一家医院急诊科的匿名医疗记录开发了一个模拟框架。该模型通过识别路线和测量每个阶段之间的过渡时间来捕获考虑分诊级别的患者流量。我们使用参数(理论)分布和非参数核密度估计(KDE)来估计这些过渡。病人入院时间用概率分布建模。我们通过由容忍阈值和定量度量指导的迭代改进过程增强了现实性。该过程将合成数据细化到与原始分布相匹配。结果:我们的方法在迭代方法中产生了具有低容差值的高度逼真的患者流模拟。这个过程逐渐向原始数据收敛。距离和散度指标以及统计测试结果表明,模拟与实际数据高度相似,当容差阈值较低时,100%的生成样本同时通过了Mann-Whitney U和Kolmogorov-Smirnov测试。结论:实验结果表明,即使在缺乏与服务资源相关的信息的情况下,我们的模拟方法也能有效地再现患者流动动力学,具有很高的真实感和灵活性。其可解释的设计和可调整的参数使安全的数据分析和探索替代管理策略(例如,修改潜在的病人路线或限制一些过渡)。这些特点使该方法成为支持知情决策的宝贵工具,并表明其在其他具有适当数据的医院中使用的潜力,有待外部数据集的验证。
{"title":"Interpretable statistical modeling of patient flow in emergency departments","authors":"Hugo Álvarez-Chaves,&nbsp;María D. R-Moreno","doi":"10.1016/j.jbi.2025.104937","DOIUrl":"10.1016/j.jbi.2025.104937","url":null,"abstract":"<div><h3>Objective:</h3><div>This paper aims to develop a data-driven simulation framework for modeling patient flow in a hospital Emergency Department using interpretable methods throughout the entire process in the absence of system resource data. The goal is to improve understanding of system dynamics and support decision-making processes through transparent simulations, even when resource data are unavailable.</div></div><div><h3>Methods:</h3><div>We developed a simulation framework using anonymized medical records from a Spanish hospital’s Emergency Department. The model captures patient flow considering triage levels by identifying routes and measuring the transition times between each stage in them. We estimated these transitions using both parametric (theoretical) distributions and non-parametric Kernel Density Estimation (KDE). Patient admissions times are modeled by using probability distributions. We enhanced realism through an iterative refinement process guided by tolerance thresholds and quantitative metrics. This process refined the synthetic data to match the original distributions.</div></div><div><h3>Results:</h3><div>Our approach produces highly realistic patient flow simulations with low tolerance values in the iterative method. The process gradually converges toward the original data. Distance and divergence metrics, together with statistical test results, indicate a high degree of similarity between the simulations and the real data, passing the Mann–Whitney U and Kolmogorov–Smirnov tests simultaneously in 100% of the generated samples when the tolerance threshold is low.</div></div><div><h3>Conclusion:</h3><div>The experimental results demonstrate that our simulation method effectively reproduces patient flow dynamics with a high level of realism and flexibility, even in the absence of information related to service resources. Its interpretable design and adjustable parameters enable safe data analysis and the exploration of alternative management strategies (e.g., modifying potential patient routes or restricting some transitions). These features position the methodology as a valuable tool for supporting informed decision-making and suggest its potential for use in other hospitals with suitable data, pending validation on external datasets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104937"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records 从纵向患者记录中发现胰腺癌和软组织肉瘤的标志性疾病轨迹。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-19 DOI: 10.1016/j.jbi.2025.104935
Liwei Wang , Rui Li , Andrew Wen , Qiuhao Lu , Jinlian Wang , Xiaoyang Ruan , Adriana Gamboa , Neha Malik , Christina L. Roland , Matthew H.G. Katz , Heather Lyu , Hongfang Liu

Background

Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.

Objectives

We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).

Materials and Methods

Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).

Results

Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.

Conclusion

We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.
背景:大多数临床医生对罕见病的经验有限,使得诊断和治疗具有挑战性。电子健康记录(EHRs)等大型真实世界数据源提供了大量信息,可用于确定罕见肿瘤的诊断和治疗模式,从而作为临床决策辅助工具。目的:研究胰腺癌、躯干及四肢STS (STS- te)和腹部及腹膜后STS (STS- ar) 3种罕见肿瘤的特征发病轨迹。材料和方法:利用IQVIA肿瘤电子病历,我们通过匹配队列抽样、统计计算、右尾二项假设检验,确定了这些癌症患者在3 年内的显著诊断对,然后可视化了3个进展的轨迹。我们进一步用UTHealth电子健康记录(EHR)对发现的轨迹进行了系统验证。结果:结果包括266对胰腺癌,130对STS-TE, 118对STS-AR的显著诊断。我们进一步发现胰腺癌前44个2-跳(即2-进展)和136个3-跳轨迹,STS-TE前36个2-跳和37个3-跳轨迹,STS-AR前17个2-跳和5个3-跳轨迹。同时,我们发现胰腺癌后有54个2-跳和129个3-跳轨迹,STS-TE后有11个2-跳和17个3-跳轨迹,STS-AR后有5个2-跳和0个3-跳轨迹。例如,64例(0.5%)患者在胰腺癌前出现关节痛和胃食管反流病,40例(0.9%)患者在STS-TE前出现关节痛和“四肢、手、脚、手指和脚趾痛”,256例(1.9%)患者在胰腺癌后出现癌症化疗后继发粒细胞缺乏症和肿瘤相关疼痛。使用UTHealth电子病历系统验证了所发现轨迹的有效性。结论:通过利用大规模电子病历数据和轨迹挖掘方法,我们确定了所研究的罕见癌症的标志性疾病轨迹。这些疾病轨迹可以作为临床医生的潜在资源,加深他们对这些罕见癌症之前和之后病情的时间进展的理解,进一步为患者护理决策提供信息。
{"title":"Discovering signature disease trajectories in pancreatic cancer and soft-tissue sarcoma from longitudinal patient records","authors":"Liwei Wang ,&nbsp;Rui Li ,&nbsp;Andrew Wen ,&nbsp;Qiuhao Lu ,&nbsp;Jinlian Wang ,&nbsp;Xiaoyang Ruan ,&nbsp;Adriana Gamboa ,&nbsp;Neha Malik ,&nbsp;Christina L. Roland ,&nbsp;Matthew H.G. Katz ,&nbsp;Heather Lyu ,&nbsp;Hongfang Liu","doi":"10.1016/j.jbi.2025.104935","DOIUrl":"10.1016/j.jbi.2025.104935","url":null,"abstract":"<div><h3>Background</h3><div>Most clinicians have limited experience with rare diseases, making diagnosis and treatment challenging. Large real-world data sources, such as electronic health records (EHRs), provide a massive amount of information that can potentially be leveraged to determine the patterns of diagnoses and treatments for rare tumors that can serve as clinical decision aids.</div></div><div><h3>Objectives</h3><div>We aimed to discover signature disease trajectories of 3 rare cancer types: pancreatic cancer, STS of the trunk and extremity (STS-TE), and STS of the abdomen and retroperitoneum (STS-AR).</div></div><div><h3>Materials and Methods</h3><div>Leveraging IQVIA Oncology Electronic Medical Record, we identified significant diagnosis pairs across 3 years in patients with these cancers through matched cohort sampling, statistical computation, right-tailed binomial hypothesis test, and then visualized trajectories up to 3 progressions. We further conducted systematic validation for the discovered trajectories with the UTHealth Electronic Health Records (EHR).</div></div><div><h3>Results</h3><div>Results included 266 significant diagnosis pairs for pancreatic cancer, 130 for STS-TE, and 118 for STS-AR. We further found 44 2-hop (i.e., 2-progression) and 136 3-hop trajectories before pancreatic cancer, 36 2-hop and 37 3-hop trajectories before STS-TE, and 17 2-hop and 5 3-hop trajectories before STS-AR. Meanwhile, we found 54 2-hop and 129 3-hop trajectories following pancreatic cancer, 11 2-hop and 17 3-hop trajectories following STS-TE, 5 2-hop and 0 3-hop trajectories following STS-AR. For example, pain in joint and gastro-oesophageal reflux disease occurred before pancreatic cancer in 64 (0.5%) patients, pain in joint and “pain in limb, hand, foot, fingers and toes” occurred before STS-TE in 40 (0.9%) patients, agranulocytosis secondary to cancer chemotherapy and neoplasm related pain occurred after pancreatic cancer in 256 (1.9%) patients. Systematic validation using the UTHealth EHR confirmed the validity of the discovered trajectories.</div></div><div><h3>Conclusion</h3><div>We identified signature disease trajectories for the studied rare cancers by leveraging large-scale EHR data and trajectory mining approaches. These disease trajectories could serve as potential resources for clinicians to deepen their understanding of the temporal progression of conditions preceding and following these rare cancers, further informing patient-care decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104935"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145344968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications 推进医疗保健分析:机器学习,健康信息学和现实世界数据应用的专题审查。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-16 DOI: 10.1016/j.jbi.2025.104934
Maria I. Arias, Lorena Cadavid, Juan D. Velásquez

Objective

To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.

Methods

A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.

Results

Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.

Conclusions

The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.
目的:通过确定主要的专题集群,综合关键趋势,概述该领域的转化挑战和研究机会,绘制医疗保健分析的概念和方法景观。方法:采用无监督文本挖掘和聚类技术对共2,281篇scopus索引的出版物进行分析。分析的重点是确定临床、行政和公共卫生背景下医疗分析文献中反复出现的主题、方法创新和差距。结果:确定了八个主要主题:预测性医疗保健的智能系统、以患者为中心的健康分析、用于临床见解的自适应人工智能、人口健康分析、数字精神健康监测、用于健康监测的伦理分析、通过数据分析进行个性化护理,以及用于疫情应对的人工智能驱动的见解。这些反映了向实时、多模式和基于道德的分析生态系统的转变。持续存在的挑战包括数据互操作性、算法不透明、评估标准化和人口统计偏差。结论:该综述强调了新兴的优先事项,包括可解释的人工智能、联邦学习和情境感知建模,以及与数据隐私和数字公平相关的道德考虑。实用建议包括与医疗保健专业人员共同设计、投资基础设施以及部署实时临床决策支持。医疗保健分析被定位为学习卫生系统的基础支柱,对转化研究和精确健康具有广泛的影响。
{"title":"Advancing healthcare analytics: a thematic review of machine learning, health informatics, and real-world data applications","authors":"Maria I. Arias,&nbsp;Lorena Cadavid,&nbsp;Juan D. Velásquez","doi":"10.1016/j.jbi.2025.104934","DOIUrl":"10.1016/j.jbi.2025.104934","url":null,"abstract":"<div><h3>Objective</h3><div>To map the conceptual and methodological landscape of healthcare analytics by identifying dominant thematic clusters, synthesizing key trends, and outlining translational challenges and research opportunities in the field.</div></div><div><h3>Methods</h3><div>A total of 2,281 Scopus-indexed publications were analyzed using unsupervised text mining and clustering techniques. The analysis focused on identifying recurring themes, methodological innovations, and gaps within healthcare analytics literature across clinical, administrative, and public health contexts.</div></div><div><h3>Results</h3><div>Eight dominant themes were identified: intelligent systems for predictive healthcare, patient-centered health analytics, adaptive AI for clinical insights, demographic health analytics, digital mental health surveillance, ethical analytics for health surveillance, personalized care through data analytics, and AI-driven insights for outbreak response. These reflect a transition toward real-time, multimodal, and ethically grounded analytics ecosystems. Persistent challenges include data interoperability, algorithmic opacity, standardization of evaluation, and demographic bias.</div></div><div><h3>Conclusions</h3><div>The review highlights emerging priorities, including explainable AI, federated learning, and context-aware modeling, as well as ethical considerations related to data privacy and digital equity. Practical recommendations include co-designing with healthcare professionals, investing in infrastructure, and deploying real-time clinical decision support. Healthcare analytics is positioned as a foundational pillar of learning health systems with broad implications for translational research and precision health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104934"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GraphFusion: Integrative prediction of drug synergy using multi-scale graph representations and cell line contexts GraphFusion:使用多尺度图形表示和细胞系上下文对药物协同作用进行综合预测。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-30 DOI: 10.1016/j.jbi.2025.104921
Biyang Zeng, Shikui Tu, Lei Xu
Predicting the synergy of drug combinations is crucial for cancer treatment and drug development. Accurate prediction requires the integration of multiple types of data, including molecular structures of individual drugs, available synergy scores between drugs, and gene expression information from different cancer cell lines. The first two types contain multi-scale information within or between drugs, while the cell lines serve as the contextual background for drug interactions. Existing machine learning methods fail to fully utilize and integrate these information, leading to suboptimal performance. To address this issue, we introduce GraphFusion, an innovative approach that combines molecular graphs and drug synergy graphs with cell line contextual information. By employing novel GCN and Graphormer modules capable of accepting and utilizing external information, GraphFusion integrates these two levels of graph information. Specifically, the molecular graphs pass fine-grained structural information to the synergy graphs, while the synergy graphs convey global drug interaction data to the molecular graphs. Additionally, cell line information is incorporated as contextual background. This comprehensive integration enables GraphFusion to achieve state-of-the-art results on the O’Neil and NCI-ALMANAC datasets.
预测药物组合的协同作用对癌症治疗和药物开发至关重要。准确的预测需要整合多种类型的数据,包括单个药物的分子结构,药物之间可用的协同评分,以及来自不同癌细胞系的基因表达信息。前两种类型包含药物内部或药物之间的多尺度信息,而细胞系则作为药物相互作用的背景。现有的机器学习方法不能充分利用和整合这些信息,导致性能不佳。为了解决这个问题,我们引入了GraphFusion,这是一种将分子图和药物协同作用图与细胞系上下文信息相结合的创新方法。GraphFusion采用能够接受和利用外部信息的新型GCN和graphhormer模块,将这两个层次的图形信息整合在一起。具体来说,分子图将细粒度的结构信息传递给协同图,而协同图将全局药物相互作用数据传递给分子图。此外,细胞系信息被纳入上下文背景。这种全面的集成使GraphFusion能够在O'Neil和NCI-ALMANAC数据集上实现最先进的结果。
{"title":"GraphFusion: Integrative prediction of drug synergy using multi-scale graph representations and cell line contexts","authors":"Biyang Zeng,&nbsp;Shikui Tu,&nbsp;Lei Xu","doi":"10.1016/j.jbi.2025.104921","DOIUrl":"10.1016/j.jbi.2025.104921","url":null,"abstract":"<div><div>Predicting the synergy of drug combinations is crucial for cancer treatment and drug development. Accurate prediction requires the integration of multiple types of data, including molecular structures of individual drugs, available synergy scores between drugs, and gene expression information from different cancer cell lines. The first two types contain multi-scale information within or between drugs, while the cell lines serve as the contextual background for drug interactions. Existing machine learning methods fail to fully utilize and integrate these information, leading to suboptimal performance. To address this issue, we introduce GraphFusion, an innovative approach that combines molecular graphs and drug synergy graphs with cell line contextual information. By employing novel GCN and Graphormer modules capable of accepting and utilizing external information, GraphFusion integrates these two levels of graph information. Specifically, the molecular graphs pass fine-grained structural information to the synergy graphs, while the synergy graphs convey global drug interaction data to the molecular graphs. Additionally, cell line information is incorporated as contextual background. This comprehensive integration enables GraphFusion to achieve state-of-the-art results on the O’Neil and NCI-ALMANAC datasets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104921"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-scale semantic fusion integration of dual pathway models in drug repositioning 药物重新定位中双通路模型的跨尺度语义融合整合。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-25 DOI: 10.1016/j.jbi.2025.104914
Mingxuan Li, Shuai Li, Zhen Li, Mandong Hu
Drug Repositioning (DR) represents an innovative drug development strategy that significantly reduces both cost and time by identifying new therapeutic indications for approved drugs. Current methods primarily focus on extracting information from drug–disease networks, but often overlook critical local structural details between nodes. This study introduces CSDPDR, a novel Dual-branch graph neural network that integrates Topology Feature Information and Salient Feature Information to enhance drug repositioning accuracy and efficiency. Through the Topology-aware branch with Adaptive Residual Graph Attention and the Saliency-aware branch with Score-Driven Top-K Convolutional Graph Pooling, the model can capture both large-scale topology patterns and fine-grained local information. Furthermore, our approach effectively alleviate graph sparsity issues through meta-path-based network enhancement and confidence-based filtering mechanisms. Comparative experiments on two benchmark datasets an additional dataset demonstrate that CSDPDR significantly outperforms several state-of-the-art baseline methods. Case studies on Alzheimer’s disease and breast neoplasms further validate the model’s practical applicability and effectiveness.
药物重新定位(DR)是一种创新的药物开发策略,通过为已批准的药物确定新的治疗适应症,显著降低成本和时间。目前的方法主要集中于从药物-疾病网络中提取信息,但往往忽略了节点之间关键的局部结构细节。本研究引入了一种新的双分支图神经网络CSDPDR,该网络将拓扑特征信息和显著特征信息相结合,以提高药物重定位的准确性和效率。通过自适应残差图注意的拓扑感知分支和分数驱动的Top-K卷积图池的显著性感知分支,该模型既能捕获大规模的拓扑模式,又能捕获细粒度的局部信息。此外,我们的方法通过基于元路径的网络增强和基于置信度的过滤机制有效地缓解了图稀疏性问题。在两个基准数据集和另一个数据集上的对比实验表明,CSDPDR显著优于几种最先进的基线方法。阿尔茨海默病和乳腺肿瘤的案例研究进一步验证了该模型的实用性和有效性。
{"title":"Cross-scale semantic fusion integration of dual pathway models in drug repositioning","authors":"Mingxuan Li,&nbsp;Shuai Li,&nbsp;Zhen Li,&nbsp;Mandong Hu","doi":"10.1016/j.jbi.2025.104914","DOIUrl":"10.1016/j.jbi.2025.104914","url":null,"abstract":"<div><div>Drug Repositioning (DR) represents an innovative drug development strategy that significantly reduces both cost and time by identifying new therapeutic indications for approved drugs. Current methods primarily focus on extracting information from drug–disease networks, but often overlook critical local structural details between nodes. This study introduces CSDPDR, a novel Dual-branch graph neural network that integrates Topology Feature Information and Salient Feature Information to enhance drug repositioning accuracy and efficiency. Through the Topology-aware branch with Adaptive Residual Graph Attention and the Saliency-aware branch with Score-Driven Top-K Convolutional Graph Pooling, the model can capture both large-scale topology patterns and fine-grained local information. Furthermore, our approach effectively alleviate graph sparsity issues through meta-path-based network enhancement and confidence-based filtering mechanisms. Comparative experiments on two benchmark datasets an additional dataset demonstrate that CSDPDR significantly outperforms several state-of-the-art baseline methods. Case studies on Alzheimer’s disease and breast neoplasms further validate the model’s practical applicability and effectiveness.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104914"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LFVDNet: Low-frequency variable-driven network for medical time series LFVDNet:医疗时间序列的低频变量驱动网络。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-23 DOI: 10.1016/j.jbi.2025.104913
Yue Zhang , Dengqun Sun , Lei Li , Jian Zhou , Xiuquan Du , Shuo Li

Objective:

Medical time series, a type of multivariate time series with missing values, is widely used to predict time series analysis, the “impute first, then predict” end-to-end architecture is used to address this issue. However, existing methods are likely to lead to the loss of uniqueness and key information of low-frequency sampled variables (LFSVs) when dealing with them. In this paper, we aim to develop a method that effectively handles LFSVs, preserving their distinctive characteristics and essential information throughout the modeling process.

Methods:

We propose a novel end-to-end method named Low-Frequency Variable-Driven network (LFVDNet) for medical time series analysis. Specifically, the Time-Aware Imputer (TA) module encodes the observed values and critical time information, and uses the attention mechanism to establish an association between the observed values and the missing values. TA adopts channel-independent strategy to prevent interference from high-frequency sampled variables (HFSVs) on LFSVs, thereby preserving the unique information contained in LFSVs. The Offset-Selection Module (OS) independently selects data points for each variable through offsets, avoiding the natural disadvantages of LFSVs in selection-based imputation, thus solving the problem of the loss of key information of LFSVs. LFVDNet is the first method for analyzing multivariate time series with missing values that emphasizes the effective utilization of LFSVs.

Results:

We carried out the experiments on four public datasets and the experimental results indicate that LFVDNet has better robustness and performance. All code is available at https://github.com/dxqllp/LFVDNet.

Conclusions:

This study proposes a novel method for medical time series analysis, namely LFVDNet, which aims to effectively utilize LFSVs. Specifically, we have designed the TA module, which performs imputation through temporal correlations. The OS module, on the other hand, performs selective imputation based on a data point selection strategy. We have verified the effectiveness of this method on four datasets constructed from PhysioNet 2012 and MIMIC-IV.
目的:医学时间序列作为一种多变量时间序列的缺失值预测被广泛应用于时间序列分析,采用“先估算后预测”的端到端架构解决这一问题。然而,现有的方法在处理低频采样变量(LFSVs)时,容易导致其唯一性和关键信息的丢失。在本文中,我们的目标是开发一种有效处理LFSVs的方法,在整个建模过程中保留其独特的特征和基本信息。方法:提出一种新的端到端医学时间序列分析方法——低频变量驱动网络(LFVDNet)。具体来说,TA (time - aware Imputer)模块对观测值和关键时间信息进行编码,并利用注意机制在观测值和缺失值之间建立关联。TA采用信道无关策略,防止高频采样变量(HFSVs)对LFSVs的干扰,从而保留了LFSVs中所包含的唯一信息。偏移选择模块(Offset-Selection Module, OS)通过偏移量独立选择每个变量的数据点,避免了LFSVs在基于选择的插值中固有的缺点,从而解决了LFSVs关键信息丢失的问题。LFVDNet是第一个强调lfsv有效利用的多变量缺失值时间序列分析方法。结果:我们在四个公共数据集上进行了实验,实验结果表明LFVDNet具有更好的鲁棒性和性能。本文提出了一种新的医学时间序列分析方法,即LFVDNet,旨在有效地利用lfsv。具体而言,我们设计了TA模块,该模块通过时间相关性进行imputation。另一方面,操作系统模块根据数据点选择策略执行选择性插补。我们在PhysioNet 2012和MIMIC-IV构建的四个数据集上验证了该方法的有效性。
{"title":"LFVDNet: Low-frequency variable-driven network for medical time series","authors":"Yue Zhang ,&nbsp;Dengqun Sun ,&nbsp;Lei Li ,&nbsp;Jian Zhou ,&nbsp;Xiuquan Du ,&nbsp;Shuo Li","doi":"10.1016/j.jbi.2025.104913","DOIUrl":"10.1016/j.jbi.2025.104913","url":null,"abstract":"<div><h3>Objective:</h3><div>Medical time series, a type of multivariate time series with missing values, is widely used to predict time series analysis, the “impute first, then predict” end-to-end architecture is used to address this issue. However, existing methods are likely to lead to the loss of uniqueness and key information of low-frequency sampled variables (LFSVs) when dealing with them. In this paper, we aim to develop a method that effectively handles LFSVs, preserving their distinctive characteristics and essential information throughout the modeling process.</div></div><div><h3>Methods:</h3><div>We propose a novel end-to-end method named <em><strong>L</strong>ow-<strong>F</strong>requency <strong>V</strong>ariable-<strong>D</strong>riven network</em> (LFVDNet) for medical time series analysis. Specifically, the Time-Aware Imputer (TA) module encodes the observed values and critical time information, and uses the attention mechanism to establish an association between the observed values and the missing values. TA adopts channel-independent strategy to prevent interference from high-frequency sampled variables (HFSVs) on LFSVs, thereby preserving the unique information contained in LFSVs. The Offset-Selection Module (OS) independently selects data points for each variable through offsets, avoiding the natural disadvantages of LFSVs in selection-based imputation, thus solving the problem of the loss of key information of LFSVs. LFVDNet is the first method for analyzing multivariate time series with missing values that emphasizes the effective utilization of LFSVs.</div></div><div><h3>Results:</h3><div>We carried out the experiments on four public datasets and the experimental results indicate that LFVDNet has better robustness and performance. All code is available at <span><span>https://github.com/dxqllp/LFVDNet</span><svg><path></path></svg></span>.</div></div><div><h3>Conclusions:</h3><div>This study proposes a novel method for medical time series analysis, namely LFVDNet, which aims to effectively utilize LFSVs. Specifically, we have designed the TA module, which performs imputation through temporal correlations. The OS module, on the other hand, performs selective imputation based on a data point selection strategy. We have verified the effectiveness of this method on four datasets constructed from PhysioNet 2012 and MIMIC-IV.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104913"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145149181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data 构建过采样(BEFO)基因表达数据生物学评价框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-17 DOI: 10.1016/j.jbi.2025.104932
Kevin Fee , Suneil Jain , Ross G. Murphy , Anna Jurek-Loughrey
Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.
机器学习(ML)技术正逐渐被用于生物医学研究,以提高诊断和预后的准确性,当与临床医生一起作为决策支持系统使用时。然而,生物医学研究中使用的许多数据集往往由于人口规模小而存在严重的类不平衡,这导致机器学习模型偏向于大多数类样本。目前的过采样方法主要侧重于平衡数据集,而没有充分验证合成数据的生物学相关性,这可能会影响下游模型预测的临床适用性。为了解决这些缺点,我们提出了过采样生物评估框架(BEFO),旨在确保合成基因表达样本准确反映原始数据集中存在的生物模式。这一创新不仅减轻了偏见,而且提高了预测模型在临床场景中的可信度。我们在此基础上开发了一种合成样品的排名方法,并根据其排名评估每个样品的包含情况。该排序方法在原始数据集上计算WGCNA基因共表达簇。构建了几个随机森林来评估每个合成样本与每个簇的对齐情况。只有比真实样本更重要的合成样本才会被纳入研究。实验结果表明,与五种最先进的(SOTA)过采样方法和十种分类算法相比,我们提出的ML过采样框架可以将过采样数据集的生物学可行性平均提高11%,从而在六个真实世界的基因表达数据集上平均提高9%的分类性能,从而为生物医学ML应用中的合成数据评估建立了新的标准。
{"title":"Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data","authors":"Kevin Fee ,&nbsp;Suneil Jain ,&nbsp;Ross G. Murphy ,&nbsp;Anna Jurek-Loughrey","doi":"10.1016/j.jbi.2025.104932","DOIUrl":"10.1016/j.jbi.2025.104932","url":null,"abstract":"<div><div>Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample’s inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104932"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A LangChain-based pipeline for one-shot synthetic text generation using generative pre-trained transformers in palliative care research 一种基于langchain的管道,用于姑息治疗研究中使用生成式预训练转换器的一次性合成文本生成。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-13 DOI: 10.1016/j.jbi.2025.104936
Isabel Ronan , Patrice Crowley , Eva Rombouts , Nicola Cornally , Mohamad M. Saab , David Murphy , Sabin Tabirca

Objective:

As the world’s population ages, nursing homes are of increasing importance. In order to care for a growing number of older adults, intelligent technologies are needed. Artificial Intelligence can be utilised to enhance palliative care in nursing homes. However, the data needed to train artificially intelligent agents is lacking within this sensitive domain due to privacy issues. Therefore, it is difficult for researchers to develop technological solutions. With the advent of large language models, such as ChatGPT, new text generation methods are made possible using limited data. In this pilot study, we investigate the use of large language models to generate synthetic data.

Methods:

We investigate the feasibility of using GPT-3.5 and GPT-4o models along with one-shot prompting to produce synthetic nurse notes which faithfully describe nursing home residents with met or unmet palliative care needs. We used LangChain to create a repeatable pipeline which can be adapted to different use-cases. We also compare the performance of both models using a set of qualitative and quantitative evaluations to determine which set of notes is more suitable for subsequent research.

Results:

GPT-3.5 performed slightly better than GPT-4o in our qualitative healthcare professional analysis. Quantitative analysis revealed appropriately heterogenous results across contextual similarity, lexical overlap, sentiment, and readability scores.

Conclusion:

Our work is the first investigation of such a generation method in the nursing home palliative care domain. Further refinement and validation of such data is needed in order to ensure the safe use of our approach.
目的:随着世界人口的老龄化,养老院变得越来越重要。为了照顾越来越多的老年人,需要智能技术。人工智能可以用来加强养老院的姑息治疗。然而,由于隐私问题,在这个敏感领域缺乏训练人工智能代理所需的数据。因此,研究人员很难制定技术解决方案。随着大型语言模型(如ChatGPT)的出现,使用有限数据的新文本生成方法成为可能。在这个试点研究中,我们研究了使用大型语言模型来生成合成数据。方法:探讨利用GPT-3.5和gpt - 40模型,结合一次性提示,制作真实描述满足或未满足姑息治疗需求的疗养院居民的合成护理笔记的可行性。我们使用LangChain创建了一个可重复的管道,它可以适应不同的用例。我们还使用一组定性和定量评价来比较两种模型的性能,以确定哪一组笔记更适合后续研究。结果:GPT-3.5在定性医疗专业分析中的表现略好于gpt - 40。定量分析揭示了上下文相似性、词汇重叠、情感和可读性得分的适当异质性结果。结论:我们的工作是在养老院姑息治疗领域的这种生成方法的第一次调查。为了确保我们的方法的安全使用,需要进一步改进和验证这些数据。
{"title":"A LangChain-based pipeline for one-shot synthetic text generation using generative pre-trained transformers in palliative care research","authors":"Isabel Ronan ,&nbsp;Patrice Crowley ,&nbsp;Eva Rombouts ,&nbsp;Nicola Cornally ,&nbsp;Mohamad M. Saab ,&nbsp;David Murphy ,&nbsp;Sabin Tabirca","doi":"10.1016/j.jbi.2025.104936","DOIUrl":"10.1016/j.jbi.2025.104936","url":null,"abstract":"<div><h3>Objective:</h3><div>As the world’s population ages, nursing homes are of increasing importance. In order to care for a growing number of older adults, intelligent technologies are needed. Artificial Intelligence can be utilised to enhance palliative care in nursing homes. However, the data needed to train artificially intelligent agents is lacking within this sensitive domain due to privacy issues. Therefore, it is difficult for researchers to develop technological solutions. With the advent of large language models, such as ChatGPT, new text generation methods are made possible using limited data. In this pilot study, we investigate the use of large language models to generate synthetic data.</div></div><div><h3>Methods:</h3><div>We investigate the feasibility of using GPT-3.5 and GPT-4o models along with one-shot prompting to produce synthetic nurse notes which faithfully describe nursing home residents with met or unmet palliative care needs. We used LangChain to create a repeatable pipeline which can be adapted to different use-cases. We also compare the performance of both models using a set of qualitative and quantitative evaluations to determine which set of notes is more suitable for subsequent research.</div></div><div><h3>Results:</h3><div>GPT-3.5 performed slightly better than GPT-4o in our qualitative healthcare professional analysis. Quantitative analysis revealed appropriately heterogenous results across contextual similarity, lexical overlap, sentiment, and readability scores.</div></div><div><h3>Conclusion:</h3><div>Our work is the first investigation of such a generation method in the nursing home palliative care domain. Further refinement and validation of such data is needed in order to ensure the safe use of our approach.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104936"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145300760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive evaluation framework for synthetic medical tabular data generation 合成医学表格数据生成的综合评价框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-14 DOI: 10.1016/j.jbi.2025.104939
Anastasia Kurakova, Hajar Homayouni
Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: https://github.com/akurakova/SDE_Framework
机器学习(ML)应用使医疗保健领域取得了重大进步,例如预测流行病、个性化治疗和开发救生药物。然而,机器学习模型训练需要大型数据集,而由于隐私问题,这些数据集在医疗保健领域很难获得。合成数据生成提供了一个很有前途的解决方案,它在保护患者隐私的同时提供了对大规模训练数据的访问。我们的研究聚焦于表格式医疗数据,电子健康记录(EHRs)的主要格式,并引入了一个综合评估框架,从四个关键维度评估合成数据:质量、隐私、可用性和数据生成过程的计算复杂性。该框架确保合成数据与ML应用程序的真实数据保持足够的相似性,同时保护患者的机密性。为了验证我们的方法,我们应用了六个最先进的(SOTA)生成模型来生成合成医疗数据集,并在我们的框架内对它们进行了评估。与主要关注统计相似性的传统方法相比,我们的框架提供了更广泛的评估,包括异常值检测、隐私风险和特定领域的约束。我们的研究结果表明,我们的框架可以识别合成数据生成模型中的关键缺陷,例如重复行的放大和超出范围值的生成,这些都被传统的统计评估方法所忽视。我们的框架实现可以在:https://github.com/akurakova/SDE_Framework上找到。
{"title":"A comprehensive evaluation framework for synthetic medical tabular data generation","authors":"Anastasia Kurakova,&nbsp;Hajar Homayouni","doi":"10.1016/j.jbi.2025.104939","DOIUrl":"10.1016/j.jbi.2025.104939","url":null,"abstract":"<div><div>Machine learning (ML) applications have enabled significant advancements in healthcare, such as predicting pandemics, personalizing treatments, and developing life-saving drugs. However, ML model training requires large datasets, which are difficult to obtain in healthcare due to privacy concerns. Synthetic data generation offers a promising solution by providing access to large-scale training data while protecting patient privacy. Our research focuses on tabular medical data, the predominant format for Electronic Health Records (EHRs), and introduces a comprehensive evaluation framework that assesses synthetic data in four critical dimensions: quality, privacy, usability, and computational complexity of the data generation process. The framework ensures that synthetic data maintains sufficient similarity to real data for ML applications while preserving patient confidentiality. To validate our approach, we applied six state-of-the-art (SOTA) generative models to generate synthetic medical datasets and evaluated them within our framework. In contrast to conventional approaches that focus primarily on statistical similarity, our framework provides a broader assessment that incorporates outlier detection, privacy risks, and domain-specific constraints. Our findings demonstrate that our framework can identify critical shortcomings in synthetic data generation models, such as the amplification of duplicate rows and the generation of out-of-range values, which are overlooked by traditional statistical evaluation methods. Our implementation of the framework is available at: <span><span>https://github.com/akurakova/SDE_Framework</span><svg><path></path></svg></span></div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104939"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145308155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1