Biodata Mining最新文献_第4页

Computational prediction of cellular elastic modulus from mechanosensitive gene expression at multiple biological levels. 从力学敏感基因表达在多个生物学水平上的细胞弹性模量的计算预测。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-11-18 DOI: 10.1186/s13040-025-00496-z

Yanhong Xiong, Xiaoyan Zhou, Qi Wang

Background: Understanding cellular mechanical properties is crucial for investigating cell fate determination, embryonic development, and disease progression. Traditional methods of measuring cellular mechanical properties, such as atomic force microscopy, are time-consuming, labor-intensive, and low-throughput. Computational models which can capture the relationship between mechanosensitive gene expression, a readily accessible and cost-effective data source, and cellular mechanical properties offer promising alternatives.

Results: In this study, we identified mechanosensitive genes from 104 cell lines, using RNA-seq data and corresponding elastic modulus from the MechanoBase database. Several statistical learning models were tested and gradient boosting regression emerged as the most effective, outperforming other models in accuracy. We termed this model MechanoGEPred. The model demonstrated its ability to predict elastic modulus variations across tissue samples, single cells, and tissue spatial domains, capturing complex relationships between gene expression and mechanical properties.

Conclusions: By enabling predictions at multiple biological levels, MechanoGEPred offers a useful framework for inferring cellular elastic modulus directly from gene expression data. The model reveals biologically meaningful patterns and context-dependent differences, suggesting potential applications in biomechanics and cancer research, and providing a proof of concept for studying mechanical heterogeneity and its role in health and disease.

背景：了解细胞力学特性对于研究细胞命运决定、胚胎发育和疾病进展至关重要。传统的测量细胞力学性能的方法，如原子力显微镜，耗时，劳动密集，低通量。计算模型可以捕捉机械敏感基因表达之间的关系，这是一种易于获取和成本效益高的数据源，以及细胞力学特性提供了有希望的替代方案。结果：在这项研究中，我们利用来自MechanoBase数据库的RNA-seq数据和相应的弹性模量，从104个细胞系中鉴定出机械敏感基因。测试了几种统计学习模型，梯度增强回归是最有效的，在准确性上优于其他模型。我们称这个模型为MechanoGEPred。该模型证明了其预测组织样本、单细胞和组织空间域弹性模量变化的能力，并捕获了基因表达和力学特性之间的复杂关系。结论：通过在多个生物学水平上进行预测，MechanoGEPred为直接从基因表达数据推断细胞弹性模量提供了一个有用的框架。该模型揭示了具有生物学意义的模式和环境依赖差异，提示了在生物力学和癌症研究中的潜在应用，并为研究力学异质性及其在健康和疾病中的作用提供了概念证明。

{"title":"Computational prediction of cellular elastic modulus from mechanosensitive gene expression at multiple biological levels.","authors":"Yanhong Xiong, Xiaoyan Zhou, Qi Wang","doi":"10.1186/s13040-025-00496-z","DOIUrl":"10.1186/s13040-025-00496-z","url":null,"abstract":"Background: Understanding cellular mechanical properties is crucial for investigating cell fate determination, embryonic development, and disease progression. Traditional methods of measuring cellular mechanical properties, such as atomic force microscopy, are time-consuming, labor-intensive, and low-throughput. Computational models which can capture the relationship between mechanosensitive gene expression, a readily accessible and cost-effective data source, and cellular mechanical properties offer promising alternatives.Results: In this study, we identified mechanosensitive genes from 104 cell lines, using RNA-seq data and corresponding elastic modulus from the MechanoBase database. Several statistical learning models were tested and gradient boosting regression emerged as the most effective, outperforming other models in accuracy. We termed this model MechanoGEPred. The model demonstrated its ability to predict elastic modulus variations across tissue samples, single cells, and tissue spatial domains, capturing complex relationships between gene expression and mechanical properties.Conclusions: By enabling predictions at multiple biological levels, MechanoGEPred offers a useful framework for inferring cellular elastic modulus directly from gene expression data. The model reveals biologically meaningful patterns and context-dependent differences, suggesting potential applications in biomechanics and cancer research, and providing a proof of concept for studying mechanical heterogeneity and its role in health and disease.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"80"},"PeriodicalIF":6.1,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12625330/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145551445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From prompt engineering to agent engineering: expanding the AI toolbox with autonomous agentic AI collaborators for biomedical discovery. 从提示工程到代理工程：用自主代理AI合作者扩展AI工具箱，用于生物医学发现。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-11-13 DOI: 10.1186/s13040-025-00502-4

Jason H Moore, Nicholas P Tatonetti

引用次数: 0

CAUSALRLSTACK: adaptive balancing of deep representation and causal effect estimation with application to HIV-related health data. CAUSALRLSTACK：深度表示和因果效应估计的自适应平衡在hiv相关健康数据中的应用。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-11-05 DOI: 10.1186/s13040-025-00492-3

Dat Thanh Pham, Khai Quang Tran, Viet Anh Nguyen

Background and objective: Estimating individualized causal effects plays a vital role in data-driven decision-making, especially in high-risk domains such as public health. However, current causal inference models often lack flexibility and generalizability due to the tight coupling between representation learning and effect estimation. This study aims to develop a modular and adaptive framework to enhance the analysis of individualized causal effects in complex health data.

Methods: We propose CAUSALRLSTACK, a modular framework designed to separate representation learning from causal effect estimation. In practice, the model uses a memory-augmented Transformer (TITAN) to capture complex, individualized representations. It is further paired with a doubly robust estimator(DRLearner) to improve the treatment effect estimation. A reinforcement learning agent adjusts how much each component contributes by assigning instance-specific weights. This adaptive weighting process improves the model's ability to generalize across different populations. Input features are derived from causal graphs, automatically chosen between an expert-defined graph and one discovered from data. To evaluate performance, we applied the framework to two publicly available HIV datasets that reflect community-level testing behavior and post-intervention clinical outcomes.

Results: CAUSALRLSTACK outperforms six state-of-the-art causal inference models across both datasets, achieving the highest accuracy (0.861 and 0.855), F1-Score (0.845 and 0.839), and AUC-ROC (0.897 and 0.892). It also achieves the lowest predictive uncertainty (0.093 and 0.092), indicating robust performance in estimating treatment effects.

Conclusions: The proposed framework offers a flexible and effective solution for individualized causal inference. Its modular architecture and reinforcement learning-based weighting strategy enable adaptive, data-driven estimation across diverse populations. Strong experimental results demonstrate the potential of the framework to advance individualized causal inference in health data and provide a practical basis for designing personalized intervention strategies in HIV and broader public health domains.

背景和目的：估计个体化因果效应在数据驱动的决策中起着至关重要的作用，特别是在公共卫生等高风险领域。然而，由于表征学习和效果估计之间的紧密耦合，目前的因果推理模型往往缺乏灵活性和泛化性。本研究旨在开发一个模块化和适应性框架，以加强对复杂健康数据中个性化因果效应的分析。方法：我们提出了CAUSALRLSTACK，这是一个模块化框架，旨在将表征学习与因果效应估计分开。在实践中，该模型使用内存增强转换器（TITAN）来捕获复杂的、个性化的表示。进一步与双鲁棒估计器（DRLearner）配对，以改善治疗效果的估计。强化学习代理通过分配特定于实例的权重来调整每个组件的贡献。这种自适应加权过程提高了模型在不同种群间的泛化能力。输入特征来源于因果图，在专家定义的图和从数据中发现的图之间自动选择。为了评估绩效，我们将该框架应用于反映社区水平检测行为和干预后临床结果的两个公开可用的HIV数据集。结果：CAUSALRLSTACK在两个数据集上都优于六种最先进的因果推理模型，达到了最高的准确率（0.861和0.855），F1-Score（0.845和0.839）和AUC-ROC（0.897和0.892）。它还实现了最低的预测不确定性（0.093和0.092），表明在估计治疗效果方面具有稳健的性能。结论：该框架为个性化因果推理提供了灵活有效的解决方案。它的模块化架构和基于强化学习的加权策略可以在不同的人群中进行自适应的、数据驱动的估计。强有力的实验结果证明了该框架在促进卫生数据个性化因果推理方面的潜力，并为在艾滋病毒和更广泛的公共卫生领域设计个性化干预策略提供了实践基础。

{"title":"CAUSALRLSTACK: adaptive balancing of deep representation and causal effect estimation with application to HIV-related health data.","authors":"Dat Thanh Pham, Khai Quang Tran, Viet Anh Nguyen","doi":"10.1186/s13040-025-00492-3","DOIUrl":"10.1186/s13040-025-00492-3","url":null,"abstract":"Background and objective: Estimating individualized causal effects plays a vital role in data-driven decision-making, especially in high-risk domains such as public health. However, current causal inference models often lack flexibility and generalizability due to the tight coupling between representation learning and effect estimation. This study aims to develop a modular and adaptive framework to enhance the analysis of individualized causal effects in complex health data.Methods: We propose CAUSALRLSTACK, a modular framework designed to separate representation learning from causal effect estimation. In practice, the model uses a memory-augmented Transformer (TITAN) to capture complex, individualized representations. It is further paired with a doubly robust estimator(DRLearner) to improve the treatment effect estimation. A reinforcement learning agent adjusts how much each component contributes by assigning instance-specific weights. This adaptive weighting process improves the model's ability to generalize across different populations. Input features are derived from causal graphs, automatically chosen between an expert-defined graph and one discovered from data. To evaluate performance, we applied the framework to two publicly available HIV datasets that reflect community-level testing behavior and post-intervention clinical outcomes.Results: CAUSALRLSTACK outperforms six state-of-the-art causal inference models across both datasets, achieving the highest accuracy (0.861 and 0.855), F1-Score (0.845 and 0.839), and AUC-ROC (0.897 and 0.892). It also achieves the lowest predictive uncertainty (0.093 and 0.092), indicating robust performance in estimating treatment effects.Conclusions: The proposed framework offers a flexible and effective solution for individualized causal inference. Its modular architecture and reinforcement learning-based weighting strategy enable adaptive, data-driven estimation across diverse populations. Strong experimental results demonstrate the potential of the framework to advance individualized causal inference in health data and provide a practical basis for designing personalized intervention strategies in HIV and broader public health domains.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"77"},"PeriodicalIF":6.1,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12587697/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145453252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning-driven TCRβ repertoire analysis enhances diagnosis and enables mining of immunological biomarkers in systemic lupus erythematosus. 深度学习驱动的TCRβ库分析增强了系统性红斑狼疮的诊断，并使免疫生物标志物的挖掘成为可能。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-31 DOI: 10.1186/s13040-025-00490-5

Tongfei Shen, Yifei Sheng, Wan Nie, Shuo Yang, Kaiqi Li, Ziwei Ma, Zhao Ling, Bowen Tan, Xikang Feng, Miaozhe Huo

Background: Systemic Lupus Erythematosus (SLE) is a complex autoimmune disorder involving dysregulation of multiple immune components, including T cells. Aberrant T-cell activity contributes significantly to the immune pathology of SLE, for instance, by facilitating autoantibody production. The Complementarity Determining Region 3 (CDR3) of the TCRβ chain is pivotal for T-cell specificity, thereby positioning it as a promising target for enhancing diagnostic accuracy and gaining deeper mechanistic insights into SLE. To address these diagnostic limitations in SLE, our team developed DeepTAPE, a deep learning-based diagnostic framework that utilizes CDR3 sequences to achieve robust classification performance for SLE.

Results: Building upon the foundation established by DeepTAPE, we devised a novel diagnostic approach that effectively integrates a TCR classifier to quantify SLE disease activity. Furthermore, this methodology employs advanced deep learning models for the bio-mining of disease-associated motifs that serve as potential biomarkers. As a result, this approach generates an autoimmune risk score (ARS) indicative of SLE probability. Notably, this ARS metric exhibited a strong correlation with disease activity, functioning as a quantitative clinical marker that complements traditional indices such as the SLE Disease Activity Index (SLEDAI). In addition, through a comprehensive analysis of immune repertoire data, we identified SLE-specific amino acid motifs within the CDR3 sequences, including critical 3-mer and gapped-mer oligopeptides. These motifs demonstrated high efficacy in SLE classification, achieving an area under the curve (AUC) of 0.908, thereby significantly outperforming other candidate biomarkers. Moreover, our model revealed potential SLE-associated antigens and genes, such as CD109 and INS, which provide new insights into the immunological mechanisms underlying the disease.

Conclusion: This study highlights the potential of DeepTAPE as a supportive tool for biomarker discovery and assessing SLE disease activity, which complements traditional diagnostic approaches. By deepening our understanding of the immunological characteristics and mechanisms associated with SLE, this work lays a foundation for advancing targeted therapies and personalized medicine in autoimmune diseases. Consequently, our findings may pave the way for improved patient outcomes and more effective treatment strategies in the management of SLE.

背景：系统性红斑狼疮（SLE）是一种复杂的自身免疫性疾病，涉及多种免疫成分的失调，包括T细胞。例如，异常的t细胞活性通过促进自身抗体的产生，对SLE的免疫病理有重要贡献。TCRβ链的互补决定区3 （CDR3）对t细胞特异性至关重要，因此将其定位为提高诊断准确性和深入了解SLE机制的有希望的靶点。为了解决SLE的这些诊断局限性，我们的团队开发了DeepTAPE，这是一种基于深度学习的诊断框架，利用CDR3序列实现SLE的鲁棒分类性能。结果：在DeepTAPE建立的基础上，我们设计了一种新的诊断方法，有效地集成了TCR分类器来量化SLE疾病活动。此外，该方法采用先进的深度学习模型，对作为潜在生物标志物的疾病相关基序进行生物挖掘。因此，这种方法产生了一个自身免疫风险评分（ARS），表明SLE的可能性。值得注意的是，该ARS指标与疾病活动性表现出很强的相关性，作为定量临床标志物，补充了SLE疾病活动性指数（SLEDAI）等传统指标。此外，通过对免疫库数据的综合分析，我们在CDR3序列中确定了sle特异性氨基酸基序，包括关键的3-聚体和缺口聚体寡肽。这些基序在SLE分类中表现出很高的功效，达到了0.908的曲线下面积（AUC），从而显著优于其他候选生物标志物。此外，我们的模型揭示了潜在的sle相关抗原和基因，如CD109和INS，这为该疾病的免疫学机制提供了新的见解。结论：本研究强调了DeepTAPE作为生物标志物发现和评估SLE疾病活动性的支持工具的潜力，它补充了传统的诊断方法。通过加深我们对SLE的免疫学特征和相关机制的理解，本工作为推进自身免疫性疾病的靶向治疗和个性化治疗奠定了基础。因此，我们的研究结果可能为改善SLE患者的预后和更有效的治疗策略铺平道路。

{"title":"Deep learning-driven TCRβ repertoire analysis enhances diagnosis and enables mining of immunological biomarkers in systemic lupus erythematosus.","authors":"Tongfei Shen, Yifei Sheng, Wan Nie, Shuo Yang, Kaiqi Li, Ziwei Ma, Zhao Ling, Bowen Tan, Xikang Feng, Miaozhe Huo","doi":"10.1186/s13040-025-00490-5","DOIUrl":"10.1186/s13040-025-00490-5","url":null,"abstract":"Background: Systemic Lupus Erythematosus (SLE) is a complex autoimmune disorder involving dysregulation of multiple immune components, including T cells. Aberrant T-cell activity contributes significantly to the immune pathology of SLE, for instance, by facilitating autoantibody production. The Complementarity Determining Region 3 (CDR3) of the TCRβ chain is pivotal for T-cell specificity, thereby positioning it as a promising target for enhancing diagnostic accuracy and gaining deeper mechanistic insights into SLE. To address these diagnostic limitations in SLE, our team developed DeepTAPE, a deep learning-based diagnostic framework that utilizes CDR3 sequences to achieve robust classification performance for SLE.Results: Building upon the foundation established by DeepTAPE, we devised a novel diagnostic approach that effectively integrates a TCR classifier to quantify SLE disease activity. Furthermore, this methodology employs advanced deep learning models for the bio-mining of disease-associated motifs that serve as potential biomarkers. As a result, this approach generates an autoimmune risk score (ARS) indicative of SLE probability. Notably, this ARS metric exhibited a strong correlation with disease activity, functioning as a quantitative clinical marker that complements traditional indices such as the SLE Disease Activity Index (SLEDAI). In addition, through a comprehensive analysis of immune repertoire data, we identified SLE-specific amino acid motifs within the CDR3 sequences, including critical 3-mer and gapped-mer oligopeptides. These motifs demonstrated high efficacy in SLE classification, achieving an area under the curve (AUC) of 0.908, thereby significantly outperforming other candidate biomarkers. Moreover, our model revealed potential SLE-associated antigens and genes, such as CD109 and INS, which provide new insights into the immunological mechanisms underlying the disease.Conclusion: This study highlights the potential of DeepTAPE as a supportive tool for biomarker discovery and assessing SLE disease activity, which complements traditional diagnostic approaches. By deepening our understanding of the immunological characteristics and mechanisms associated with SLE, this work lays a foundation for advancing targeted therapies and personalized medicine in autoimmune diseases. Consequently, our findings may pave the way for improved patient outcomes and more effective treatment strategies in the management of SLE.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"76"},"PeriodicalIF":6.1,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12577242/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FISM: harnessing deep learning and reinforcement learning for precision detection of microaneurysms and retinal exudates for early diabetic retinopathy diagnosis. FISM：利用深度学习和强化学习精确检测微动脉瘤和视网膜渗出液，用于早期糖尿病视网膜病变诊断。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-30 DOI: 10.1186/s13040-025-00485-2

Abbas Rehman, Gu Naijie, Stephen Ojo, Thomas I Nathaniel, Nagwan Abdel Samee, Muhammad Umer, Mona M Jamjoom

Diabetic retinopathy (DR) is a primary cause of blindness globally and its treatment and management depend on accurate and timely identification. Current approaches for DR detection and segmentation repeatedly fall short in accuracy and sturdiness highlighting the essential for advanced computational methods. In this study propose a deep learning model Fundus Images Segmentation Model (FISM) designed to precisely detect microaneurysms and retinal exudates dangerous indicators of DR. Employing the Diabetic Retinopathy Dataset (DDR), our model utilizes both the segmentation and grading subsets, comprising over 13,000 fundus images annotated with comprehensive lesion-level and DR severity information, enabling robust training for both detection and classification tasks. The preprocessing pipeline contains band separation generative adversarial network (GAN) based data augmentation and extensive normalization techniques. The FISM architecture is derived from the Segment Anything Model (SAM) exclusively integrating transformer layers and patch embedding techniques. The model begins with patch embedding followed by transformer blocks to capture both local and global relationships within retinal images. The architecture employs transfer learning, domain-specific fine-tuning customized loss functions and attention mechanisms to optimize feature extraction and segmentation accuracy. The image encoder and Mask decoder modules work in tandem to transform input retinal images into precise segmentation Masks, highlighting regions affected by DR. Beyond deep learning, the framework also integrates reinforcement learning to constructively direct the exploration of regions of interest so that the model is capable of highlighting areas of interest to a diagnosis. This form of adaptive attention is an improvement in the precision of detection and computational cost. Results show that FISM outperforms state-of-the-art methods, achieving 96.32% accuracy, 95.14% precision, 95.25% recall and a 96.33% F1-score. The model demonstrates an AUC of 96.32%, specificity of 94.13%, segmentation Dice coefficient of 94.21% and IoU of 96.01%. These metrics indicate superior performance in both detection and segmentation tasks for early diabetic retinopathy diagnosis.

糖尿病视网膜病变（DR）是全球失明的主要原因，其治疗和管理取决于准确和及时的识别。目前的DR检测和分割方法在准确性和稳健性方面一再不足，这突出了先进的计算方法的必要性。在这项研究中，我们提出了一个深度学习模型眼底图像分割模型（FISM），旨在精确检测微动脉瘤和视网膜渗出物的DR危险指标，我们的模型采用糖尿病视网膜病变数据集（DDR），我们的模型利用分割和分级子集，包括超过13,000个眼底图像，这些图像带有全面的病变水平和DR严重程度信息，能够实现检测和分类任务的鲁棒训练。预处理流程包含基于带分离生成对抗网络（GAN）的数据增强和广泛归一化技术。FISM架构源自分段任意模型（SAM），该模型专门集成了变压器层和补丁嵌入技术。该模型从补丁嵌入开始，然后是变压器块，以捕获视网膜图像中的局部和全局关系。该体系结构采用迁移学习、特定领域微调自定义损失函数和注意机制来优化特征提取和分割精度。图像编码器和Mask解码器模块协同工作，将输入的视网膜图像转换为精确的分割Mask，突出显示受dr影响的区域。除了深度学习之外，该框架还集成了强化学习，以建设性地指导感兴趣区域的探索，从而使模型能够突出显示感兴趣的诊断区域。这种形式的自适应关注提高了检测精度和计算成本。结果表明，FISM的准确率为96.32%，精密度为95.14%，召回率为95.25%，f1评分为96.33%。该模型的AUC为96.32%，特异性为94.13%，分割Dice系数为94.21%，IoU为96.01%。这些指标表明，在检测和分割任务的早期糖尿病视网膜病变诊断优越的性能。

{"title":"FISM: harnessing deep learning and reinforcement learning for precision detection of microaneurysms and retinal exudates for early diabetic retinopathy diagnosis.","authors":"Abbas Rehman, Gu Naijie, Stephen Ojo, Thomas I Nathaniel, Nagwan Abdel Samee, Muhammad Umer, Mona M Jamjoom","doi":"10.1186/s13040-025-00485-2","DOIUrl":"10.1186/s13040-025-00485-2","url":null,"abstract":"Diabetic retinopathy (DR) is a primary cause of blindness globally and its treatment and management depend on accurate and timely identification. Current approaches for DR detection and segmentation repeatedly fall short in accuracy and sturdiness highlighting the essential for advanced computational methods. In this study propose a deep learning model Fundus Images Segmentation Model (FISM) designed to precisely detect microaneurysms and retinal exudates dangerous indicators of DR. Employing the Diabetic Retinopathy Dataset (DDR), our model utilizes both the segmentation and grading subsets, comprising over 13,000 fundus images annotated with comprehensive lesion-level and DR severity information, enabling robust training for both detection and classification tasks. The preprocessing pipeline contains band separation generative adversarial network (GAN) based data augmentation and extensive normalization techniques. The FISM architecture is derived from the Segment Anything Model (SAM) exclusively integrating transformer layers and patch embedding techniques. The model begins with patch embedding followed by transformer blocks to capture both local and global relationships within retinal images. The architecture employs transfer learning, domain-specific fine-tuning customized loss functions and attention mechanisms to optimize feature extraction and segmentation accuracy. The image encoder and Mask decoder modules work in tandem to transform input retinal images into precise segmentation Masks, highlighting regions affected by DR. Beyond deep learning, the framework also integrates reinforcement learning to constructively direct the exploration of regions of interest so that the model is capable of highlighting areas of interest to a diagnosis. This form of adaptive attention is an improvement in the precision of detection and computational cost. Results show that FISM outperforms state-of-the-art methods, achieving 96.32% accuracy, 95.14% precision, 95.25% recall and a 96.33% F1-score. The model demonstrates an AUC of 96.32%, specificity of 94.13%, segmentation Dice coefficient of 94.21% and IoU of 96.01%. These metrics indicate superior performance in both detection and segmentation tasks for early diabetic retinopathy diagnosis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"75"},"PeriodicalIF":6.1,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576988/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using artificial intelligence (AI) to model clinical variant reporting for next generation sequencing (NGS) oncology assays. 使用人工智能（AI）来模拟下一代测序（NGS）肿瘤学分析的临床变异报告。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-29 DOI: 10.1186/s13040-025-00489-y

Kenneth D Doig, Rashindrie Perera, Yamuna Kankanige, Andrew Fellowes, Jason Li, Richard Lupat, Ella R Thompson, Piers Blombery, Stephen B Fox

Background: Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses data from targeted clinical sequencing to build machine learning models that predict whether patient variants should be reported.

Methods: Three somatic assays were used to build machine learning prediction models using the estimators Logistic Regression, Random Forest, XGBoost and Neural Networks. Using manual expert curation to select reportable variants as ground truth, we built models to classify clinically reportable variants. Assays were performed between 2020 and 2023 yielding 1,350,018 variants and used to report on 10,116 patients. All variants, together with 211 annotations and sequencing features, were used by the models to predict the likelihood of variants being reported.

Results: The tree-based ensemble models performed consistently well achieving between 0.904 and 0.996 on the precision recall/area under the curve (PRC AUC) metric when predicting whether a variant should be reported. To assist model explainability, individual model predictions were presented to users within a tertiary analysis platform as a waterfall plot showing individual feature contributions and their values for the variant. Over 30% of the model performance was due to features sourced from statistics derived in-house from the sequencing assay precluding easy generalization of the models to other assays or other laboratories.

Conclusions: Longitudinally acquired NGS assay data provide a strong basis for machine learning models for decision support to select variants for clinical oncology reports. The models provide a framework for consistent reporting practices and reducing inter-reviewer variability. To improve model transparency, individual variant predictions are able to be presented as part of reviewer workflows.

背景：体细胞DNA的靶向下一代测序（NGS）现在常规用于肿瘤临床的诊断和预测报告。NGS测定所需的专家基因组分析仍然是扩大被评估患者数量的瓶颈。这项研究利用来自靶向临床测序的数据来构建机器学习模型，预测是否应该报告患者变异。方法：采用Logistic回归、随机森林、XGBoost和神经网络4种估计器，采用3种体细胞法建立机器学习预测模型。使用手动专家策展来选择可报告的变体作为基础事实，我们建立了模型来分类临床可报告的变体。在2020年至2023年期间进行的分析产生了1,350,018个变异，并用于报告10,116例患者。所有变体，连同211个注释和测序特征，被模型用来预测被报告的变体的可能性。结果：基于树的集成模型在预测是否应该报告变体时，在精确召回率/曲线下面积（PRC AUC）度量上表现一致，达到0.904和0.996之间。为了帮助模型的可解释性，在三级分析平台中，单个模型预测以瀑布图的形式呈现给用户，显示了单个特征的贡献及其变量的值。超过30%的模型性能是由于来自测序分析内部统计数据的特征，排除了模型易于推广到其他分析或其他实验室的可能性。结论：纵向获取的NGS检测数据为机器学习模型提供了强有力的基础，为临床肿瘤学报告选择变异提供决策支持。这些模型为一致的报告实践和减少审稿人之间的可变性提供了一个框架。为了提高模型的透明度，单个变量预测可以作为审阅者工作流的一部分呈现。

{"title":"Using artificial intelligence (AI) to model clinical variant reporting for next generation sequencing (NGS) oncology assays.","authors":"Kenneth D Doig, Rashindrie Perera, Yamuna Kankanige, Andrew Fellowes, Jason Li, Richard Lupat, Ella R Thompson, Piers Blombery, Stephen B Fox","doi":"10.1186/s13040-025-00489-y","DOIUrl":"10.1186/s13040-025-00489-y","url":null,"abstract":"Background: Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses data from targeted clinical sequencing to build machine learning models that predict whether patient variants should be reported.Methods: Three somatic assays were used to build machine learning prediction models using the estimators Logistic Regression, Random Forest, XGBoost and Neural Networks. Using manual expert curation to select reportable variants as ground truth, we built models to classify clinically reportable variants. Assays were performed between 2020 and 2023 yielding 1,350,018 variants and used to report on 10,116 patients. All variants, together with 211 annotations and sequencing features, were used by the models to predict the likelihood of variants being reported.Results: The tree-based ensemble models performed consistently well achieving between 0.904 and 0.996 on the precision recall/area under the curve (PRC AUC) metric when predicting whether a variant should be reported. To assist model explainability, individual model predictions were presented to users within a tertiary analysis platform as a waterfall plot showing individual feature contributions and their values for the variant. Over 30% of the model performance was due to features sourced from statistics derived in-house from the sequencing assay precluding easy generalization of the models to other assays or other laboratories.Conclusions: Longitudinally acquired NGS assay data provide a strong basis for machine learning models for decision support to select variants for clinical oncology reports. The models provide a framework for consistent reporting practices and reducing inter-reviewer variability. To improve model transparency, individual variant predictions are able to be presented as part of reviewer workflows.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"74"},"PeriodicalIF":6.1,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12570631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145402603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic computational classification of bone marrow cells for B cell pediatric leukemia using UMAP. 使用UMAP对儿童B细胞白血病骨髓细胞进行自动计算分类。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-21 DOI: 10.1186/s13040-025-00488-z

Ana Niño-López, Álvaro Martínez-Rubio, Rocío Picón-González, Ana Castillo Robleda, Manuel Ramírez Orellana, Salvador Chulián, María Rosa

B Acute Lymphoblastic Leukemia (B-ALL) accounts for approximately 80% of pediatric leukemia cases. Despite treatment advances, 15-20% of children experience relapse, highlighting the need of improved monitoring of patients and novel strategies leading to successful therapies. Flow Cytometry is an essential technique for measuring residual disease and guiding treatment. However, traditional manual gating limits its efficiency. In recent years, computational tools have been integrated to enhance these clinical processes but many mathematical techniques are underexploited. Particularly, Uniform Manifold Approximation and Projection (UMAP), together with Machine Learning, provide promising approaches for analyzing large datasets. Mathematical tools and artificial intelligence offer new perspectives on these health problems, beyond the usual approach in biomedicine. We have exploited 234 samples from 75 B-ALL patients to develop an artificial intelligence-based algorithm that can improve patient classification and therapy decisions in different patient cohorts. This implies an advancement on the routine manual analysis of the disease progression, as we identify key subpopulations automatically, distinguishing patients' bone marrow regeneration patterns, thus improving the prediction and prognosis of the disease.

急性淋巴细胞白血病（B- all）约占小儿白血病病例的80%。尽管治疗取得了进展，但仍有15-20%的儿童经历复发，这突出表明需要改进对患者的监测，并采用新的策略来实现成功的治疗。流式细胞术是检测残留病变和指导治疗的重要技术。然而，传统的人工浇注限制了其效率。近年来，计算工具已被整合以加强这些临床过程，但许多数学技术尚未得到充分利用。特别是，统一流形近似和投影（UMAP）与机器学习一起，为分析大型数据集提供了有前途的方法。数学工具和人工智能为这些健康问题提供了新的视角，超出了生物医学的通常方法。我们利用来自75名B-ALL患者的234个样本开发了一种基于人工智能的算法，该算法可以改善不同患者群体的患者分类和治疗决策。这意味着对疾病进展的常规人工分析的进步，因为我们可以自动识别关键亚群，区分患者的骨髓再生模式，从而提高疾病的预测和预后。

{"title":"Automatic computational classification of bone marrow cells for B cell pediatric leukemia using UMAP.","authors":"Ana Niño-López, Álvaro Martínez-Rubio, Rocío Picón-González, Ana Castillo Robleda, Manuel Ramírez Orellana, Salvador Chulián, María Rosa","doi":"10.1186/s13040-025-00488-z","DOIUrl":"10.1186/s13040-025-00488-z","url":null,"abstract":"B Acute Lymphoblastic Leukemia (B-ALL) accounts for approximately 80% of pediatric leukemia cases. Despite treatment advances, 15-20% of children experience relapse, highlighting the need of improved monitoring of patients and novel strategies leading to successful therapies. Flow Cytometry is an essential technique for measuring residual disease and guiding treatment. However, traditional manual gating limits its efficiency. In recent years, computational tools have been integrated to enhance these clinical processes but many mathematical techniques are underexploited. Particularly, Uniform Manifold Approximation and Projection (UMAP), together with Machine Learning, provide promising approaches for analyzing large datasets. Mathematical tools and artificial intelligence offer new perspectives on these health problems, beyond the usual approach in biomedicine. We have exploited 234 samples from 75 B-ALL patients to develop an artificial intelligence-based algorithm that can improve patient classification and therapy decisions in different patient cohorts. This implies an advancement on the routine manual analysis of the disease progression, as we identify key subpopulations automatically, distinguishing patients' bone marrow regeneration patterns, thus improving the prediction and prognosis of the disease.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"73"},"PeriodicalIF":6.1,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145349506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WHFDL: an explainable method based on World Hyper-heuristic and Fuzzy Deep Learning approaches for gastric cancer detection using metabolomics data. WHFDL：一种基于世界超启发式和模糊深度学习方法的可解释方法，用于使用代谢组学数据检测胃癌。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-10 DOI: 10.1186/s13040-025-00486-1

Nora Mahdavi, Arman Daliri, Mahdieh Zabihimayvan, Yalda Yaghooti, Mohammad Mahdi Mir, Parastoo Ghazanfari, Avin Zarrabi, Pedram Khalaj, Reza Sadeghi

Background: Gastric Cancer remains one of the most prevalent cancers worldwide, with its prognosis heavily reliant on early detection. Traditional GC diagnostic methods are invasive and risky, prompting interest in non-invasive alternatives that could enhance outcomes.

Method: In this study, we introduce a non-invasive approach, World Hyper-heuristic Fuzzy Deep Learning, for gastric cancer prediction using metabolomics. Metabolomics profiles of plasma samples from 702 individuals were obtained and used for classification. To apply an efficient feature selection, we employed the World Hyper Heuristic, a metaheuristic to extract the most relevant features from the dataset. Subsequently, the extracted data were classified by implementing a Fuzzy Deep Neural Network.

Results: The performance of WHFDL was assessed and compared against a comprehensive set of classical and state-of-the-art feature selection and classification algorithms. Our results highlighted six key metabolites as biomarkers associated with gastric cancer: (1-Methyladenosine, C18-Carnitine, Guanidineacetic acid, Hypoxanthine, Nicotinamide mononucleotide, and Succinate). The WHFDL outperformed all other classifiers, achieving an F1-score, recall and precision of 94%, 93% and 94%, respectively, along with an accuracy of 94% and an Area Under the Curve of 0.9384. Interpretability were analyzed using SHAP, LIME, IG calibration analysis, and adversarial testing, demonstrating the model's transparency. The source code is available on ( https://github.com/arman-daliri/WHFDL ).

背景：胃癌仍然是世界范围内最常见的癌症之一，其预后严重依赖于早期发现。传统的GC诊断方法是侵入性的和有风险的，这促使人们对非侵入性替代方法的兴趣，这些方法可以提高结果。方法：在本研究中，我们引入了一种非侵入性的方法，世界超启发式模糊深度学习，用于代谢组学的胃癌预测。从702个人的血浆样本中获得代谢组学图谱并用于分类。为了应用有效的特征选择，我们使用了世界超启发式，一种元启发式从数据集中提取最相关的特征。随后，利用模糊深度神经网络对提取的数据进行分类。结果：对WHFDL的性能进行了评估，并与一套全面的经典和最先进的特征选择和分类算法进行了比较。我们的研究结果突出了与胃癌相关的六个关键代谢物作为生物标志物：（1-甲基腺苷、c18 -肉碱、胍乙酸、次黄嘌呤、烟酰胺单核苷酸和琥珀酸盐）。WHFDL优于所有其他分类器，分别实现了f1得分，召回率和准确率分别为94%，93%和94%，准确率为94%，曲线下面积为0.9384。使用SHAP、LIME、IG校准分析和对抗性测试分析可解释性，证明了模型的透明度。源代码可在（https://github.com/arman-daliri/WHFDL）上获得。

{"title":"WHFDL: an explainable method based on World Hyper-heuristic and Fuzzy Deep Learning approaches for gastric cancer detection using metabolomics data.","authors":"Nora Mahdavi, Arman Daliri, Mahdieh Zabihimayvan, Yalda Yaghooti, Mohammad Mahdi Mir, Parastoo Ghazanfari, Avin Zarrabi, Pedram Khalaj, Reza Sadeghi","doi":"10.1186/s13040-025-00486-1","DOIUrl":"10.1186/s13040-025-00486-1","url":null,"abstract":"Background: Gastric Cancer remains one of the most prevalent cancers worldwide, with its prognosis heavily reliant on early detection. Traditional GC diagnostic methods are invasive and risky, prompting interest in non-invasive alternatives that could enhance outcomes.Method: In this study, we introduce a non-invasive approach, World Hyper-heuristic Fuzzy Deep Learning, for gastric cancer prediction using metabolomics. Metabolomics profiles of plasma samples from 702 individuals were obtained and used for classification. To apply an efficient feature selection, we employed the World Hyper Heuristic, a metaheuristic to extract the most relevant features from the dataset. Subsequently, the extracted data were classified by implementing a Fuzzy Deep Neural Network.Results: The performance of WHFDL was assessed and compared against a comprehensive set of classical and state-of-the-art feature selection and classification algorithms. Our results highlighted six key metabolites as biomarkers associated with gastric cancer: (1-Methyladenosine, C18-Carnitine, Guanidineacetic acid, Hypoxanthine, Nicotinamide mononucleotide, and Succinate). The WHFDL outperformed all other classifiers, achieving an F1-score, recall and precision of 94%, 93% and 94%, respectively, along with an accuracy of 94% and an Area Under the Curve of 0.9384. Interpretability were analyzed using SHAP, LIME, IG calibration analysis, and adversarial testing, demonstrating the model's transparency. The source code is available on ( https://github.com/arman-daliri/WHFDL ).","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"72"},"PeriodicalIF":6.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12514820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145276395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MediNet: ensemble transfer learning approach for classification of medical drugs-related text reviews using significant combined-embeddings. MediNet：使用显著组合嵌入对药物相关文本评论进行分类的集成迁移学习方法。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-09 DOI: 10.1186/s13040-025-00448-7

Tai-Hoon Kim, Asma Aldrees, Dina Abdulaziz AlHammadi, Muhammad Umer, Taoufik Saidani, Shtwai Alsubai, Imran Ashraf

This research work provides an innovative approach, called MediNet, for drug safety review classification that integrates the strengths of three word embedding approaches: FastText, ELMo, and GloVe, alongside an ensemble of EfficientNetB4 and MobileNet models. The unique blend of these word embeddings captures both context-independent and context-dependent representations, enabling the model to understand complex linguistic nuances within drug reviews. The ensemble architecture leverages EfficientNetB4's scalability and MobileNet's efficiency, making MediNet both powerful and resource-efficient. The proposed model MediNet is evaluated concerning performance on a comprehensive dataset of drug safety reviews, achieving remarkable results with a 95.69% accuracy, 96.46% precision, 98.30% recall, and 97.22% F1 score. The generalizability of MediNet is evaluated using the cross-validation technique, demonstrating the statistical significance of the results. Additionally, MediNet results are compared against six other well-known transfer learning models, where it consistently outperforms other models across all metrics. These results suggest that MediNet is a highly effective solution for classifying drug safety reviews, significantly improving accuracy and reliability compared to existing models. The proposed approach offers a promising direction for future research in natural language processing and its application to healthcare.

这项研究工作提供了一种称为medenet的创新方法，用于药物安全审查分类，该方法集成了三种词嵌入方法的优势：FastText、ELMo和GloVe，以及effentnetb4和MobileNet模型的集合。这些词嵌入的独特混合捕获了上下文无关和上下文相关的表示，使模型能够理解药物评论中复杂的语言细微差别。集成架构利用了EfficientNetB4的可扩展性和MobileNet的效率，使medienet既强大又资源高效。在一个综合的药物安全评价数据集上，对所提出的模型MediNet进行了性能评估，取得了95.69%的准确率、96.46%的精度、98.30%的召回率和97.22%的F1分数。使用交叉验证技术评估MediNet的泛化性，显示结果的统计显著性。此外，mediinet的结果与其他六种知名的迁移学习模型进行了比较，在所有指标上，它始终优于其他模型。这些结果表明，MediNet是一种非常有效的药物安全评价分类解决方案，与现有模型相比，其准确性和可靠性显著提高。该方法为未来自然语言处理及其在医疗保健中的应用研究提供了一个有希望的方向。

{"title":"MediNet: ensemble transfer learning approach for classification of medical drugs-related text reviews using significant combined-embeddings.","authors":"Tai-Hoon Kim, Asma Aldrees, Dina Abdulaziz AlHammadi, Muhammad Umer, Taoufik Saidani, Shtwai Alsubai, Imran Ashraf","doi":"10.1186/s13040-025-00448-7","DOIUrl":"10.1186/s13040-025-00448-7","url":null,"abstract":"This research work provides an innovative approach, called MediNet, for drug safety review classification that integrates the strengths of three word embedding approaches: FastText, ELMo, and GloVe, alongside an ensemble of EfficientNetB4 and MobileNet models. The unique blend of these word embeddings captures both context-independent and context-dependent representations, enabling the model to understand complex linguistic nuances within drug reviews. The ensemble architecture leverages EfficientNetB4's scalability and MobileNet's efficiency, making MediNet both powerful and resource-efficient. The proposed model MediNet is evaluated concerning performance on a comprehensive dataset of drug safety reviews, achieving remarkable results with a 95.69% accuracy, 96.46% precision, 98.30% recall, and 97.22% F1 score. The generalizability of MediNet is evaluated using the cross-validation technique, demonstrating the statistical significance of the results. Additionally, MediNet results are compared against six other well-known transfer learning models, where it consistently outperforms other models across all metrics. These results suggest that MediNet is a highly effective solution for classifying drug safety reviews, significantly improving accuracy and reliability compared to existing models. The proposed approach offers a promising direction for future research in natural language processing and its application to healthcare.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"71"},"PeriodicalIF":6.1,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12512702/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145259399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An intelligent healthcare system for rare disease diagnosis utilizing electronic health records based on a knowledge-guided multimodal transformer framework. 基于知识引导的多模态变压器框架，利用电子健康记录进行罕见病诊断的智能医疗保健系统。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining

Pub Date : 2025-10-07 DOI: 10.1186/s13040-025-00487-0

Ahed Abugabah, Prashant Kumar Shukla, Piyush Kumar Shukla, Ankur Pandey

Rare diseases are a common problem with millions of patients globally, but their diagnosis is difficult because of varied clinical presentations, small sample size, and disparate biomedical data sources. Current diagnostic tools are not able to combine multimodal information effectively, which results in a timely or wrong diagnosis. To fill this gap, this paper suggests a smart multimodal healthcare framework integrating electronic health records (EHRs), genomic sequences, and medical imaging to improve the detection of rare diseases. The framework uses Swin Transformer to extract hierarchical visual features in radiographic scans, Med-BERT and Transformer-XL to learn semantic and long-term temporal relations in longitudinal electronic health record narratives, and a Graph Neural Network (GNN)-based encoder to learn functional and structural relations in genomic sequences. The alignment of the cross-modal representation is further boosted with a Knowledge-Guided Contrastive Learning (KGCL) mechanism, which takes advantage of rare disease ontologies in Orphanet to improve the interpretability of the model and infusion of knowledge. To achieve strong performance, the Nutcracker Optimization Algorithm (NOA) is proposed to optimize hyperparameters, calibrate attention mechanisms, and enhance multimodal fusion. Experimental results on MIMIC-IV (EHR), ClinVar (genomics), and CheXpert (imaging) datasets show that the proposed framework significantly outperforms the state-of-the-art multimodal baselines in terms of accuracy and robustness of early rare disease diagnosis. This paper presents the opportunity to integrate hierarchical vision transformers, domain-specific language models, graph-based genomic encoders, and knowledge-directed optimization to make explainable, accurate, and clinically applicable healthcare decisions in rare disease settings.

罕见病是全球数百万患者的共同问题，但由于临床表现不同、样本量小和生物医学数据源不同，罕见病的诊断很困难。目前的诊断工具不能有效地结合多模态信息，导致诊断及时或错误。为了填补这一空白，本文提出了一个集成电子健康记录（EHRs）、基因组序列和医学成像的智能多模式医疗框架，以提高罕见病的检测。该框架使用Swin Transformer来提取放射扫描中的分层视觉特征，Med-BERT和Transformer- xl来学习纵向电子健康记录叙述中的语义和长期时间关系，以及基于图神经网络（GNN）的编码器来学习基因组序列中的功能和结构关系。知识引导的对比学习（KGCL）机制进一步增强了跨模态表示的一致性，该机制利用了Orphanet中的罕见病本体来提高模型的可解释性和知识的注入。为了获得更强的性能，提出了胡桃夹子优化算法（NOA）来优化超参数、校准注意机制和增强多模态融合。在MIMIC-IV （EHR）、ClinVar（基因组学）和CheXpert（成像）数据集上的实验结果表明，所提出的框架在早期罕见病诊断的准确性和稳健性方面显著优于最先进的多模式基线。本文提供了整合分层视觉转换器、领域特定语言模型、基于图的基因组编码器和知识导向优化的机会，以在罕见疾病环境中做出可解释、准确和临床适用的医疗保健决策。

{"title":"An intelligent healthcare system for rare disease diagnosis utilizing electronic health records based on a knowledge-guided multimodal transformer framework.","authors":"Ahed Abugabah, Prashant Kumar Shukla, Piyush Kumar Shukla, Ankur Pandey","doi":"10.1186/s13040-025-00487-0","DOIUrl":"10.1186/s13040-025-00487-0","url":null,"abstract":"Rare diseases are a common problem with millions of patients globally, but their diagnosis is difficult because of varied clinical presentations, small sample size, and disparate biomedical data sources. Current diagnostic tools are not able to combine multimodal information effectively, which results in a timely or wrong diagnosis. To fill this gap, this paper suggests a smart multimodal healthcare framework integrating electronic health records (EHRs), genomic sequences, and medical imaging to improve the detection of rare diseases. The framework uses Swin Transformer to extract hierarchical visual features in radiographic scans, Med-BERT and Transformer-XL to learn semantic and long-term temporal relations in longitudinal electronic health record narratives, and a Graph Neural Network (GNN)-based encoder to learn functional and structural relations in genomic sequences. The alignment of the cross-modal representation is further boosted with a Knowledge-Guided Contrastive Learning (KGCL) mechanism, which takes advantage of rare disease ontologies in Orphanet to improve the interpretability of the model and infusion of knowledge. To achieve strong performance, the Nutcracker Optimization Algorithm (NOA) is proposed to optimize hyperparameters, calibrate attention mechanisms, and enhance multimodal fusion. Experimental results on MIMIC-IV (EHR), ClinVar (genomics), and CheXpert (imaging) datasets show that the proposed framework significantly outperforms the state-of-the-art multimodal baselines in terms of accuracy and robustness of early rare disease diagnosis. This paper presents the opportunity to integrate hierarchical vision transformers, domain-specific language models, graph-based genomic encoders, and knowledge-directed optimization to make explainable, accurate, and clinically applicable healthcare decisions in rare disease settings.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"70"},"PeriodicalIF":6.1,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12505588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145245685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0