Frontiers in bioinformatics最新文献_第2页

CIT kinase phosphorylation as significant regulatory node for cellular checkpoints. CIT激酶磷酸化是细胞检查点的重要调控节点。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-12 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1734030

Jaytha Thomas, Fathimathul Lubaba, Mukhtar Ahmed, Althaf Mahin, Levin John, Athira Perunelly Gopalakrishnan, Suhail Subair, Prathik Basthikoppa Shivamurthy, Rajesh Raju, Sowmya Soman

Introduction: Citron Rho-interacting serine/threonine kinase (CIT) is a major cytosolic protein kinase essential for midbody organisation, abscission, and cytokinesis. Dysregulation and mutations in CIT are associated with multiple cancers and neurodevelopmental disorders, including microcephaly. Although global phosphoproteomic studies have identified more than 50 phosphosites in CIT, their functional relevance and the kinases regulating them remain largely unexplored.

Methods: To systematically investigate the phosphoregulation of CIT, we curated and integrated global phosphoproteomic datasets, along with their associated experimental conditions, to comprehensively catalogue phosphorylation events reported for CIT. To assess the functional significance of CIT, we examined proteins that were differentially co-regulated with its predominant phosphosite.

Results: Serine 440 (S440), located outside the kinase domain (representing over 55% of CIT-associated phospho-signalling events across 100 experimental conditions, including Enterovirus A71 infection, metformin, and interleukin-33), was identified as its predominant phosphosite. Motif analysis revealed the presence of a D(S/T)P/P(S/T)D motif recognised by the CIT kinase domain, suggesting S440 as a predicted autophosphorylation site. Co-phosphoregulation analysis identified 136 interacting proteins and 82 predicted substrates that were positively co-regulated with CIT_S440. The resulting phospho-regulatory network comprised essential cell cycle and DNA repair regulators, including MDC1 and TRIP12. Significantly, over 120 co-regulated phosphosites were functionally linked to DNA repair and cell cycle regulation. Aberrant phosphorylation of CIT_S440 observed across cancers of the breast, colon, and bladder suggests CIT_S440 as a potential onco-phosphosite critically involved in cellular checkpoint signalling.

Discussion: These findings suggest that CIT_S440 functions as a promising therapeutic target, and the phosphosite-centric regulatory network derived in this study could serve as a platform to evaluate its phosphosite-specific therapeutic interventions.

香橼rro相互作用丝氨酸/苏氨酸激酶（CIT）是一种主要的细胞质蛋白激酶，对中间体组织、脱落和细胞分裂至关重要。CIT的失调和突变与多种癌症和包括小头畸形在内的神经发育障碍有关。尽管全球磷酸化蛋白质组学研究已经确定了50多个CIT磷酸化位点，但它们的功能相关性和调节它们的激酶在很大程度上仍未被探索。方法：为了系统地研究CIT的磷酸化调控，我们整理并整合了全球磷酸化蛋白质组学数据集，以及相关的实验条件，全面分类了CIT的磷酸化事件。为了评估CIT的功能意义，我们研究了与其主要磷酸化位点差异共调控的蛋白质。结果：丝氨酸440 （S440）位于激酶结构域外（在100种实验条件下，包括肠病毒A71感染、二甲双胍和白细胞介素-33，代表超过55%的cit相关磷酸化信号事件），被确定为其主要的磷酸化位点。基序分析显示，CIT激酶结构域识别的D(S/T)P/P(S/T)D基序存在，提示S440是预测的自磷酸化位点。共磷酸化调控分析鉴定出136个相互作用蛋白和82个预测底物与CIT_S440正共调控。由此产生的磷酸化调控网络包括必需的细胞周期和DNA修复调控因子，包括MDC1和TRIP12。值得注意的是，超过120个共调节的磷酸化位点与DNA修复和细胞周期调节有功能联系。在乳腺癌、结肠癌和膀胱癌中观察到的CIT_S440的异常磷酸化表明，CIT_S440是一种潜在的癌磷酸化位点，在细胞检查点信号传导中起关键作用。讨论：这些研究结果表明，CIT_S440是一个有希望的治疗靶点，本研究中得出的以磷位点为中心的调控网络可以作为评估其磷位点特异性治疗干预措施的平台。

{"title":"CIT kinase phosphorylation as significant regulatory node for cellular checkpoints.","authors":"Jaytha Thomas, Fathimathul Lubaba, Mukhtar Ahmed, Althaf Mahin, Levin John, Athira Perunelly Gopalakrishnan, Suhail Subair, Prathik Basthikoppa Shivamurthy, Rajesh Raju, Sowmya Soman","doi":"10.3389/fbinf.2025.1734030","DOIUrl":"10.3389/fbinf.2025.1734030","url":null,"abstract":"Introduction: Citron Rho-interacting serine/threonine kinase (CIT) is a major cytosolic protein kinase essential for midbody organisation, abscission, and cytokinesis. Dysregulation and mutations in CIT are associated with multiple cancers and neurodevelopmental disorders, including microcephaly. Although global phosphoproteomic studies have identified more than 50 phosphosites in CIT, their functional relevance and the kinases regulating them remain largely unexplored.Methods: To systematically investigate the phosphoregulation of CIT, we curated and integrated global phosphoproteomic datasets, along with their associated experimental conditions, to comprehensively catalogue phosphorylation events reported for CIT. To assess the functional significance of CIT, we examined proteins that were differentially co-regulated with its predominant phosphosite.Results: Serine 440 (S440), located outside the kinase domain (representing over 55% of CIT-associated phospho-signalling events across 100 experimental conditions, including Enterovirus A71 infection, metformin, and interleukin-33), was identified as its predominant phosphosite. Motif analysis revealed the presence of a D(S/T)P/P(S/T)D motif recognised by the CIT kinase domain, suggesting S440 as a predicted autophosphorylation site. Co-phosphoregulation analysis identified 136 interacting proteins and 82 predicted substrates that were positively co-regulated with CIT_S440. The resulting phospho-regulatory network comprised essential cell cycle and DNA repair regulators, including MDC1 and TRIP12. Significantly, over 120 co-regulated phosphosites were functionally linked to DNA repair and cell cycle regulation. Aberrant phosphorylation of CIT_S440 observed across cancers of the breast, colon, and bladder suggests CIT_S440 as a potential onco-phosphosite critically involved in cellular checkpoint signalling.Discussion: These findings suggest that CIT_S440 functions as a promising therapeutic target, and the phosphosite-centric regulatory network derived in this study could serve as a platform to evaluate its phosphosite-specific therapeutic interventions.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1734030"},"PeriodicalIF":3.9,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12833521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Genetic risk predictions using deep learning models with summary data. 使用汇总数据的深度学习模型进行遗传风险预测。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1657021

Angela Wang, Elena Xiao, Jason Cheng, Xiaoxi Shen

Background: As a driving force of the Fourth Industrial Revolution, deep learning methods have achieved significant success across various fields, including genetic and genomic studies. While individual-level genetic data is ideal for deep learning models, privacy concerns and data-sharing restrictions often limit its availability to researchers.

Methods: In this paper, we investigated the potential applications of deep learning models-including deep neural networks, convolutional neural networks, recurrent neural networks, and transformers-when only genetic summary data, such as linkage disequilibrium matrices, is available. The bootstrap method was used to approximate the test error. Simulation studies and real data analyses were conducted to compare the performance of deep learning methods in genetic risk prediction using individual-level genetic data versus genetic summary data.

Results: The test mean squared errors (MSEs) of most applied deep learning models are comparable when using individual-level data versus summary data.

Conclusion: Our results suggest that suitable deep learning methods could also serve as an alternative approach to predict disease related traits when only linkage disequilibrium matrices are available as input.

背景：作为第四次工业革命的推动力，深度学习方法在包括基因和基因组研究在内的各个领域取得了重大成功。虽然个人层面的基因数据是深度学习模型的理想选择，但隐私问题和数据共享限制往往限制了研究人员对其的可用性。方法：在本文中，我们研究了深度学习模型（包括深度神经网络、卷积神经网络、循环神经网络和变换）在只有遗传汇总数据（如链接不平衡矩阵）可用时的潜在应用。采用自举法逼近测试误差。通过模拟研究和真实数据分析，比较了深度学习方法在使用个体水平遗传数据和遗传汇总数据进行遗传风险预测方面的性能。结果：大多数应用深度学习模型的检验均方误差（MSEs）在使用个人水平数据与汇总数据时是可比较的。结论：我们的研究结果表明，当只有连锁不平衡矩阵可用作为输入时，合适的深度学习方法也可以作为预测疾病相关性状的替代方法。

{"title":"Genetic risk predictions using deep learning models with summary data.","authors":"Angela Wang, Elena Xiao, Jason Cheng, Xiaoxi Shen","doi":"10.3389/fbinf.2025.1657021","DOIUrl":"10.3389/fbinf.2025.1657021","url":null,"abstract":"Background: As a driving force of the Fourth Industrial Revolution, deep learning methods have achieved significant success across various fields, including genetic and genomic studies. While individual-level genetic data is ideal for deep learning models, privacy concerns and data-sharing restrictions often limit its availability to researchers.Methods: In this paper, we investigated the potential applications of deep learning models-including deep neural networks, convolutional neural networks, recurrent neural networks, and transformers-when only genetic summary data, such as linkage disequilibrium matrices, is available. The bootstrap method was used to approximate the test error. Simulation studies and real data analyses were conducted to compare the performance of deep learning methods in genetic risk prediction using individual-level genetic data versus genetic summary data.Results: The test mean squared errors (MSEs) of most applied deep learning models are comparable when using individual-level data versus summary data.Conclusion: Our results suggest that suitable deep learning methods could also serve as an alternative approach to predict disease related traits when only linkage disequilibrium matrices are available as input.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1657021"},"PeriodicalIF":3.9,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Altered circRNAs: a novel potential mechanism for the functions of extracellular vesicles derived from platelet-rich plasma. 改变的环状rna：源自富血小板血浆的细胞外囊泡功能的一种新的潜在机制

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1690932

Lifeng Niu, Yanli Wang, Yao Gao, Jun Zhang

Platelet-rich plasma (PRP) has been widely applied in clinical practice for tissue repair and regeneration. Recent studies have reported that large amounts of extracellular vesicles (EVs) derived from PRP (PRP-EVs) are also involved in the functions of tissue repair and regeneration, except for the secreted growth factors. However, the relevant mechanisms of PRP-EVs remain unknown. In this study, we attempted to reveal the potential circular RNA (circRNA) mechanisms of PRP-EVs using high-throughput RNA sequencing (RNA-seq) technique and bioinformatics analysis. Six healthy donors were enrolled in this study, including three donors for the isolation of PRP-EVs and three donors for the isolation of EVs derived from blood plasma (plasma-EVs). As a result, we confirmed that PRP activation by thrombin could significantly promote the formation and secretion of EVs, particularly those with diameters ranging from 50 to 200 nm. Moreover, 144 circRNAs were altered in PRP-EVs with a fold change ≥ 2.0 and p-value ≤ 0.05. Among these, 89 circRNAs were upregulated, whereas 55 circRNAs were downregulated. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, and circRNA-miRNA-mRNA interaction network analyses were performed to predict the potential roles of circRNAs in PRP-EVs. GO analysis indicated that these altered circRNAs might be related to the physiological processes of cell genesis and development. The pathways that were most strongly correlated with the biological functions of PRP-EVs were the transforming growth factor β (TGF-β) signaling pathway and HIF-1 signaling pathway. In addition, the expression levels of five selected circRNAs were verified through RT-qPCR. In conclusion, this is the first study to explain a novel potential mechanism of the biological functions of PRP-EVs in terms of the altered circRNAs. Taken together, our findings in this study may lay the groundwork for the clinical application of PRP-EVs and provide possible novel targets for further research.

富血小板血浆（PRP）已广泛应用于临床组织修复和再生。近年来的研究报道，PRP衍生的大量细胞外囊泡（PRP-EVs）除分泌生长因子外，还参与组织修复和再生的功能。然而，prp - ev的相关机制尚不清楚。在这项研究中，我们试图通过高通量RNA测序（RNA-seq）技术和生物信息学分析揭示prp - ev潜在的环状RNA （circRNA）机制。本研究招募了6名健康献血者，包括3名分离prp - ev的献血者和3名分离血浆源性ev的献血者。因此，我们证实凝血酶激活PRP可以显著促进EVs的形成和分泌，特别是直径在50 - 200 nm之间的EVs。此外，在prp - ev中有144个环状rna发生改变，其倍数变化≥2.0，p值≤0.05。其中，89个circrna上调，而55个circrna下调。通过基因本体（GO）、京都基因与基因组百科全书（KEGG）途径和circRNA-miRNA-mRNA相互作用网络分析来预测circrna在prp - ev中的潜在作用。氧化石墨烯分析表明，这些改变的环状rna可能与细胞发生和发育的生理过程有关。与prp - ev生物学功能相关性最强的通路是转化生长因子β (TGF-β)信号通路和HIF-1信号通路。此外，通过RT-qPCR验证了5个选定的circrna的表达水平。总之，这是第一个从改变环状rna的角度解释prp - ev生物学功能的新潜在机制的研究。综上所述，我们的研究结果可能为prp - ev的临床应用奠定基础，并为进一步研究提供可能的新靶点。

{"title":"Altered circRNAs: a novel potential mechanism for the functions of extracellular vesicles derived from platelet-rich plasma.","authors":"Lifeng Niu, Yanli Wang, Yao Gao, Jun Zhang","doi":"10.3389/fbinf.2025.1690932","DOIUrl":"10.3389/fbinf.2025.1690932","url":null,"abstract":"Platelet-rich plasma (PRP) has been widely applied in clinical practice for tissue repair and regeneration. Recent studies have reported that large amounts of extracellular vesicles (EVs) derived from PRP (PRP-EVs) are also involved in the functions of tissue repair and regeneration, except for the secreted growth factors. However, the relevant mechanisms of PRP-EVs remain unknown. In this study, we attempted to reveal the potential circular RNA (circRNA) mechanisms of PRP-EVs using high-throughput RNA sequencing (RNA-seq) technique and bioinformatics analysis. Six healthy donors were enrolled in this study, including three donors for the isolation of PRP-EVs and three donors for the isolation of EVs derived from blood plasma (plasma-EVs). As a result, we confirmed that PRP activation by thrombin could significantly promote the formation and secretion of EVs, particularly those with diameters ranging from 50 to 200 nm. Moreover, 144 circRNAs were altered in PRP-EVs with a fold change ≥ 2.0 and p-value ≤ 0.05. Among these, 89 circRNAs were upregulated, whereas 55 circRNAs were downregulated. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, and circRNA-miRNA-mRNA interaction network analyses were performed to predict the potential roles of circRNAs in PRP-EVs. GO analysis indicated that these altered circRNAs might be related to the physiological processes of cell genesis and development. The pathways that were most strongly correlated with the biological functions of PRP-EVs were the transforming growth factor β (TGF-β) signaling pathway and HIF-1 signaling pathway. In addition, the expression levels of five selected circRNAs were verified through RT-qPCR. In conclusion, this is the first study to explain a novel potential mechanism of the biological functions of PRP-EVs in terms of the altered circRNAs. Taken together, our findings in this study may lay the groundwork for the clinical application of PRP-EVs and provide possible novel targets for further research.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1690932"},"PeriodicalIF":3.9,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KG2ML: integrating knowledge graphs and positive unlabeled learning for identifying disease-associated genes. KG2ML：整合知识图谱和积极的无标记学习来识别疾病相关基因。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1727953

Praveen Kumar, Vincent T Metzger, Swastika T Purushotham, Priyansh Kedia, Cristian G Bologa, Christophe G Lambert, Jeremy J Yang

Background: Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge.

Methods: Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have emerged as promising tools for inferring these unexplored associations. Given the technical limitations of the Neo4j Graph Data Science (GDS) machine learning pipeline, we developed a novel machine learning pipeline called KG2ML (Knowledge Graph to Machine Learning). This pipeline utilizes our Positive and Unlabeled (PU) learning algorithm, PULSCAR (Positive Unlabeled Learning Selected Completely At Random), and incorporates path-based feature extraction from ProteinGraphML.

Results: KG2ML was applied to 12 diseases, including Bipolar Disorder, Coronary Artery Disease, and Parkinson's Disease, to infer disease-associated genes not explicitly recorded in DDKG. For several of these diseases, 14 out of the 15 top-ranked genes lacked prior explicit associations in the DDKG but were supported by literature and TINX (Target Importance and Novelty Explorer) evidence. Incorporating PULSCAR-imputed genes as positives enhanced XGBoost classification, demonstrating the potential of PU learning in identifying hidden gene-disease relationships.

Conclusion: The observed improvement in classification performance after the inclusion of PULSCAR-imputed genes as positive examples, along with the subject matter experts' (SME) evaluations of the top 15 imputed genes for 12 diseases, suggests that PU learning can effectively uncover disease-gene associations missing from existing knowledge graphs (KGs). By integrating KG data with ML-based inference, our KG2ML pipeline provides a scalable and interpretable framework to advance biomedical research while addressing the inherent limitations of current KGs.

背景：生物医学知识图（KGs），如数据蒸馏知识图（DDKG），捕获实体（如基因、疾病、蛋白质）之间已知的关系，为研究提供有价值的见解。然而，这些关系通常来自先前的研究，潜在的未知关联尚未被探索。确定这些未知的关联，包括以前未知的疾病相关基因，仍然是生物信息学的一项重大挑战，对推进生物医学知识至关重要。方法：传统的方法，如连锁分析和全基因组关联研究（GWAS），可能耗时且资源密集。这突出表明需要有效的计算方法来利用已知的疾病-基因关联来识别或预测新的基因。最近，由于机器学习（ML）框架的进步，基于网络的方法和kg已成为推断这些未探索关联的有前途的工具。考虑到Neo4j图数据科学（GDS）机器学习管道的技术限制，我们开发了一种新的机器学习管道，称为KG2ML（知识图到机器学习）。该管道利用了我们的Positive and Unlabeled （PU）学习算法PULSCAR (Positive Unlabeled learning Selected Completely At Random)，并结合了ProteinGraphML中基于路径的特征提取。结果：KG2ML应用于12种疾病，包括双相情感障碍、冠状动脉疾病和帕金森病，以推断DDKG中未明确记录的疾病相关基因。对于其中的一些疾病，15个排名靠前的基因中有14个与DDKG缺乏明确的关联，但得到文献和TINX（目标重要性和新新性探索者）证据的支持。将pulscar输入的基因作为阳性基因增强了XGBoost分类，证明了PU学习在识别隐藏基因与疾病关系方面的潜力。结论：在纳入PULSCAR-imputed基因作为正例后，观察到分类性能的提高，以及主题专家（SME）对12种疾病的前15个imputed基因的评价，表明PU学习可以有效地揭示现有知识图谱（KGs）中缺失的疾病-基因关联。通过将KG数据与基于ml的推理相结合，我们的KG2ML管道提供了一个可扩展和可解释的框架，以推进生物医学研究，同时解决当前KG的固有局限性。

{"title":"KG2ML: integrating knowledge graphs and positive unlabeled learning for identifying disease-associated genes.","authors":"Praveen Kumar, Vincent T Metzger, Swastika T Purushotham, Priyansh Kedia, Cristian G Bologa, Christophe G Lambert, Jeremy J Yang","doi":"10.3389/fbinf.2025.1727953","DOIUrl":"10.3389/fbinf.2025.1727953","url":null,"abstract":"Background: Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge.Methods: Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have emerged as promising tools for inferring these unexplored associations. Given the technical limitations of the Neo4j Graph Data Science (GDS) machine learning pipeline, we developed a novel machine learning pipeline called KG2ML (Knowledge Graph to Machine Learning). This pipeline utilizes our Positive and Unlabeled (PU) learning algorithm, PULSCAR (Positive Unlabeled Learning Selected Completely At Random), and incorporates path-based feature extraction from ProteinGraphML.Results: KG2ML was applied to 12 diseases, including Bipolar Disorder, Coronary Artery Disease, and Parkinson's Disease, to infer disease-associated genes not explicitly recorded in DDKG. For several of these diseases, 14 out of the 15 top-ranked genes lacked prior explicit associations in the DDKG but were supported by literature and TINX (Target Importance and Novelty Explorer) evidence. Incorporating PULSCAR-imputed genes as positives enhanced XGBoost classification, demonstrating the potential of PU learning in identifying hidden gene-disease relationships.Conclusion: The observed improvement in classification performance after the inclusion of PULSCAR-imputed genes as positive examples, along with the subject matter experts' (SME) evaluations of the top 15 imputed genes for 12 diseases, suggests that PU learning can effectively uncover disease-gene associations missing from existing knowledge graphs (KGs). By integrating KG data with ML-based inference, our KG2ML pipeline provides a scalable and interpretable framework to advance biomedical research while addressing the inherent limitations of current KGs.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1727953"},"PeriodicalIF":3.9,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823822/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessment of phylogenetic informativeness in mitochondrial and nuclear genes for mammalian systematics using sparse learning. 利用稀疏学习评估哺乳动物系统分类学中线粒体和核基因的系统发育信息。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1704212

Carlos G Schrago, Beatriz Mello

Despite the growing availability of nuclear genomic data, mitochondrial genes remain the most widely used molecular markers in mammalian systematics. However, a quantitative assessment of the phylogenetic information content of mitochondrial loci compared to nuclear loci has never been carried out. Here, we apply a sparse learning approach based on Lasso regression to evaluate the contribution of alignment sites to phylogenetic likelihoods, providing the first estimates of phylogenetically effective lengths for markers commonly used in mammalian systematics. Analyzing more than 30,000 complete mammalian mitochondrial genomes and nuclear panels composed of either 100 randomly selected complete coding sequences or of partial gene segments from conventional markers, we examined phylogenetic informativeness at two taxonomic levels: within-species and among-species. On average, ∼32% of mitochondrial sites and ∼38% of nuclear sites were classified as phylogenetically informative. We found that the number of phylogenetically informative sites were positively correlated with total gene length. Therefore, longer mitochondrial genes, particularly ND5, COX1, and CYTB, harbored the largest numbers of informative sites. Although nuclear coding sequences contained, on average, more informative sites, mitochondrial genes also yielded consistent resolution of among-species relationships. Overall, our results provide the first large-scale, quantitative comparison of phylogenetic information content across mammalian mitochondrial and nuclear genes, offering a principled framework for marker selection in future systematics studies that can be broadly applied to any lineage.

尽管核基因组数据的可用性越来越高，线粒体基因仍然是哺乳动物系统学中最广泛使用的分子标记。然而，线粒体位点与核位点的系统发育信息含量的定量评估从未进行过。在这里，我们应用基于Lasso回归的稀疏学习方法来评估比对位点对系统发育可能性的贡献，为哺乳动物系统学中常用的标记提供了系统发育有效长度的第一个估计。我们分析了30,000多个完整的哺乳动物线粒体基因组和由100个随机选择的完整编码序列或来自常规标记的部分基因片段组成的核面板，在两个分类水平上检查了种内和种间的系统发育信息。平均而言，~ 32%的线粒体位点和~ 38%的核位点被分类为系统发育信息。我们发现系统发育信息位点的数量与基因总长度呈正相关。因此，较长的线粒体基因，特别是ND5、COX1和CYTB，拥有最多的信息位点。虽然核编码序列平均包含更多信息位点，但线粒体基因也提供了物种间关系的一致解决方案。总的来说，我们的研究结果首次提供了哺乳动物线粒体和核基因之间系统发育信息含量的大规模定量比较，为未来系统分类学研究中的标记选择提供了一个原则性框架，可以广泛应用于任何谱系。

{"title":"Assessment of phylogenetic informativeness in mitochondrial and nuclear genes for mammalian systematics using sparse learning.","authors":"Carlos G Schrago, Beatriz Mello","doi":"10.3389/fbinf.2025.1704212","DOIUrl":"10.3389/fbinf.2025.1704212","url":null,"abstract":"Despite the growing availability of nuclear genomic data, mitochondrial genes remain the most widely used molecular markers in mammalian systematics. However, a quantitative assessment of the phylogenetic information content of mitochondrial loci compared to nuclear loci has never been carried out. Here, we apply a sparse learning approach based on Lasso regression to evaluate the contribution of alignment sites to phylogenetic likelihoods, providing the first estimates of phylogenetically effective lengths for markers commonly used in mammalian systematics. Analyzing more than 30,000 complete mammalian mitochondrial genomes and nuclear panels composed of either 100 randomly selected complete coding sequences or of partial gene segments from conventional markers, we examined phylogenetic informativeness at two taxonomic levels: within-species and among-species. On average, ∼32% of mitochondrial sites and ∼38% of nuclear sites were classified as phylogenetically informative. We found that the number of phylogenetically informative sites were positively correlated with total gene length. Therefore, longer mitochondrial genes, particularly ND5, COX1, and CYTB, harbored the largest numbers of informative sites. Although nuclear coding sequences contained, on average, more informative sites, mitochondrial genes also yielded consistent resolution of among-species relationships. Overall, our results provide the first large-scale, quantitative comparison of phylogenetic information content across mammalian mitochondrial and nuclear genes, offering a principled framework for marker selection in future systematics studies that can be broadly applied to any lineage.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1704212"},"PeriodicalIF":3.9,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824000/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identification of multiple prognostic biomarker sets for risk stratification in SKCM. 鉴定SKCM风险分层的多种预后生物标志物。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-07 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1624329

Shivani Malik, Ritu Tomer, Akanksha Arora, Gajendra P S Raghava

Introduction: The majority of available transcriptomics-related cancer prognosis studies strive to define one collection of biomarkers that can be used to predict high-risk patients. However, using a single biomarker profile could restrict its strength and applicability to diverse groups of patients. In order to fill this gap, we discuss the prospect of determining several, discrete sets of prognostic biomarkers in Skin Cutaneous Melanoma (SKCM). Our search identifies various genes including CREG1, PCGF5 and VPS13C whose expression pattern depicts significant correlations with overall survival (OS) in SKCM patients.Methods: We developed machine learning-based prognostic models using SKCM gene expression data to predict 1-, 3-, and 5-year overall survival. Advanced feature selection approaches were applied to identify prognostic biomarkers. The primary biomarker set consisted of 20 genes selected using state-of-the-art feature selection techniques. Machine learning classifiers were trained to distinguish high-risk from low-risk patients using these biomarkers. The process was systematically repeated to identify seven independent biomarker sets, each containing 20 unique genes without overlap. Model performance was evaluated using AUC and Cohen's Kappa metrics on an independent test dataset. Validation was further performed using the GEO dataset GSE65904, employing subsets of biomarkers from the primary and third sets.Results: The primary biomarker-based prognostic model demonstrated strong predictive ability, achieving an AUC of 0.90 and a Kappa of 0.58 in identifying high-risk SKCM patients. A second independent 20-gene set, with no overlap with the first, produced an AUC of 0.89 and Kappa of 0.56. Across all seven biomarker sets, performance ranged from 0.84 to 0.91 (AUC) and 0.48 to 0.64 (Kappa). Notably, the fifth biomarker set yielded the highest performance with an AUC of 0.91 and Kappa of 0.64. External validation confirmed the predictive utility of selected biomarkers where genes from the primary set achieved an AUC of 0.83 on GSE65904. While genes from the third set achieved an AUC of 0.86 on the same dataset.Discussion: Our results show that only one gene-expression signature is not sufficient to predict SKCM prognosis. Alternatively, high-risk patients can be accurately predicted using multiple independent biomarker sets providing flexibility in both clinical and computational practices. The high similarity in the results of all seven sets (AUC 0.84-0.91; Kappa 0.48-0.64) signifies the stability and strength of the method. The external validation of these biomarkers with GEO data also helps to confirm the reliability of these biomarkers and hints at their potential wider applicability. This work facilitates transparency by ensuring that all the data and code is publicly accessible (https://github.com/raghavagps/skcm_prognostic_biomarker), which a

大多数可用的与转录组学相关的癌症预后研究努力定义一组可用于预测高危患者的生物标志物。然而，使用单一的生物标志物可能会限制其强度和对不同患者群体的适用性。为了填补这一空白，我们讨论了确定皮肤黑色素瘤（SKCM）中几个离散的预后生物标志物的前景。我们的研究发现了多种基因，包括CREG1、PCGF5和VPS13C，它们的表达模式与SKCM患者的总生存期（OS）有显著相关性。方法：我们开发了基于机器学习的预后模型，使用SKCM基因表达数据来预测1、3和5年的总生存期。采用先进的特征选择方法来识别预后生物标志物。主要生物标记集由20个基因组成，使用最先进的特征选择技术选择。训练机器学习分类器使用这些生物标志物区分高风险和低风险患者。系统地重复该过程以鉴定7个独立的生物标记集，每个标记集包含20个独特的无重叠基因。在一个独立的测试数据集上使用AUC和Cohen的Kappa指标来评估模型的性能。使用GEO数据集GSE65904进一步验证，使用来自第一集和第三集的生物标志物子集。结果：基于生物标志物的初级预后模型显示出较强的预测能力，识别高危SKCM患者的AUC为0.90，Kappa为0.58。第二个独立的20个基因集，与第一个没有重叠，产生的AUC为0.89，Kappa为0.56。在所有7个生物标志物组中，性能范围为0.84至0.91 （AUC）和0.48至0.64 （Kappa）。值得注意的是，第五个生物标记集的AUC为0.91，Kappa为0.64，表现最佳。外部验证证实了所选生物标志物的预测效用，其中来自初级集的基因在GSE65904上的AUC达到0.83。而第三组基因在同一数据集上的AUC为0.86。讨论：我们的结果表明，仅一种基因表达特征不足以预测SKCM的预后。另外，可以使用多个独立的生物标志物集准确预测高危患者，为临床和计算实践提供灵活性。7组结果具有较高的相似性（AUC 0.84-0.91; Kappa 0.48-0.64），说明了该方法的稳定性和强度。利用GEO数据对这些生物标志物进行外部验证也有助于确认这些生物标志物的可靠性，并暗示其潜在的更广泛的适用性。这项工作通过确保所有数据和代码都是可公开访问的，从而促进了透明度（https://github.com/raghavagps/skcm_prognostic_biomarker），这也促进了在黑色素瘤中创建多特征预后工具的未来发展。

{"title":"Identification of multiple prognostic biomarker sets for risk stratification in SKCM.","authors":"Shivani Malik, Ritu Tomer, Akanksha Arora, Gajendra P S Raghava","doi":"10.3389/fbinf.2025.1624329","DOIUrl":"10.3389/fbinf.2025.1624329","url":null,"abstract":"Introduction: The majority of available transcriptomics-related cancer prognosis studies strive to define one collection of biomarkers that can be used to predict high-risk patients. However, using a single biomarker profile could restrict its strength and applicability to diverse groups of patients. In order to fill this gap, we discuss the prospect of determining several, discrete sets of prognostic biomarkers in Skin Cutaneous Melanoma (SKCM). Our search identifies various genes including CREG1, PCGF5 and VPS13C whose expression pattern depicts significant correlations with overall survival (OS) in SKCM patients.Methods: We developed machine learning-based prognostic models using SKCM gene expression data to predict 1-, 3-, and 5-year overall survival. Advanced feature selection approaches were applied to identify prognostic biomarkers. The primary biomarker set consisted of 20 genes selected using state-of-the-art feature selection techniques. Machine learning classifiers were trained to distinguish high-risk from low-risk patients using these biomarkers. The process was systematically repeated to identify seven independent biomarker sets, each containing 20 unique genes without overlap. Model performance was evaluated using AUC and Cohen's Kappa metrics on an independent test dataset. Validation was further performed using the GEO dataset GSE65904, employing subsets of biomarkers from the primary and third sets.Results: The primary biomarker-based prognostic model demonstrated strong predictive ability, achieving an AUC of 0.90 and a Kappa of 0.58 in identifying high-risk SKCM patients. A second independent 20-gene set, with no overlap with the first, produced an AUC of 0.89 and Kappa of 0.56. Across all seven biomarker sets, performance ranged from 0.84 to 0.91 (AUC) and 0.48 to 0.64 (Kappa). Notably, the fifth biomarker set yielded the highest performance with an AUC of 0.91 and Kappa of 0.64. External validation confirmed the predictive utility of selected biomarkers where genes from the primary set achieved an AUC of 0.83 on GSE65904. While genes from the third set achieved an AUC of 0.86 on the same dataset.Discussion: Our results show that only one gene-expression signature is not sufficient to predict SKCM prognosis. Alternatively, high-risk patients can be accurately predicted using multiple independent biomarker sets providing flexibility in both clinical and computational practices. The high similarity in the results of all seven sets (AUC 0.84-0.91; Kappa 0.48-0.64) signifies the stability and strength of the method. The external validation of these biomarkers with GEO data also helps to confirm the reliability of these biomarkers and hints at their potential wider applicability. This work facilitates transparency by ensuring that all the data and code is publicly accessible (https://github.com/raghavagps/skcm_prognostic_biomarker), which a","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1624329"},"PeriodicalIF":3.9,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12819672/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advances in protein-protein interaction prediction: a deep learning perspective. 蛋白质-蛋白质相互作用预测的进展：深度学习视角。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-07 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1710937

Noor Alkhateeb, Mamoun Awad

Protein-protein interactions (PPIs) are vital for regulating various cellular functions and understanding how diseases are developed. The traditional ways to identify the PPIs are costly and time-consuming. In recent years, the disruptive advances in deep learning (DL) have transformed computational PPI prediction by enabling automatic feature extraction from protein sequences and structures. This survey presents a comprehensive analysis of DL-based models developed for PPI prediction, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), deep neural networks (DNNs), graph convolutional networks (GCNs), and ensemble architectures. The review compares their feature representations, learning strategies, and evaluation benchmarks, emphasizing their strengths and limitations in capturing complex dependencies and structural relationships. In addition, the paper elaborates on different benchmarks and biological databases that are commonly used in different experiments for performance comparison. Finally, we outline open challenges and future research directions to enhance model generalization, interpretability, and integration with biological knowledge for reliable PPI prediction.

蛋白质-蛋白质相互作用（PPIs）对于调节各种细胞功能和了解疾病如何发展至关重要。识别ppi的传统方法既昂贵又耗时。近年来，深度学习（DL）的突破性进展通过从蛋白质序列和结构中自动提取特征，改变了PPI的计算预测。本研究全面分析了用于PPI预测的基于dl的模型，包括卷积神经网络（cnn）、循环神经网络（rnn）、深度神经网络（dnn）、图卷积网络（GCNs）和集成架构。本文比较了它们的特征表示、学习策略和评估基准，强调了它们在捕获复杂依赖关系和结构关系方面的优势和局限性。此外，本文还详细介绍了不同实验中常用的不同基准和生物数据库进行性能比较。最后，我们概述了开放的挑战和未来的研究方向，以增强模型的泛化，可解释性和与生物学知识的整合，以实现可靠的PPI预测。

{"title":"Advances in protein-protein interaction prediction: a deep learning perspective.","authors":"Noor Alkhateeb, Mamoun Awad","doi":"10.3389/fbinf.2025.1710937","DOIUrl":"10.3389/fbinf.2025.1710937","url":null,"abstract":"Protein-protein interactions (PPIs) are vital for regulating various cellular functions and understanding how diseases are developed. The traditional ways to identify the PPIs are costly and time-consuming. In recent years, the disruptive advances in deep learning (DL) have transformed computational PPI prediction by enabling automatic feature extraction from protein sequences and structures. This survey presents a comprehensive analysis of DL-based models developed for PPI prediction, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), deep neural networks (DNNs), graph convolutional networks (GCNs), and ensemble architectures. The review compares their feature representations, learning strategies, and evaluation benchmarks, emphasizing their strengths and limitations in capturing complex dependencies and structural relationships. In addition, the paper elaborates on different benchmarks and biological databases that are commonly used in different experiments for performance comparison. Finally, we outline open challenges and future research directions to enhance model generalization, interpretability, and integration with biological knowledge for reliable PPI prediction.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1710937"},"PeriodicalIF":3.9,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12819794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stain-free artificial intelligence-assisted light microscopy for the identification of leukocyte morphology change in presence of bacteria. 无染色人工智能辅助光学显微镜用于鉴定细菌存在时白细胞形态的变化。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-06 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1725145

Alexander Hunt, Holger Schulze, Kay Samuel, Robert B Fisher, Till T Bachmann

Background: Rapid detection of bacterial infections through leukocyte activation analysis could significantly reduce diagnostic timeframes from days to hours. Traditional methods like flow cytometry and biomarker assays face limitations including technical complexity, equipment requirements, and delayed results.

Methods: We developed a dual artificial neural network system combining stain-free light microscopy with microfluidic technology to detect morphological changes in activated leukocytes. YOLOv4 networks were trained using five-fold cross-validation on images of chemically stimulated leukocyte subpopulations (lymphocytes, monocytes, and neutrophils) and validated against flow cytometry. The system was tested on whole blood samples spiked with E. coli at clinically relevant concentrations (10-250 CFU/mL).

Results: The optimized four-class network achieved high performance metrics for lymphocytes (F1 score: 80.1% ± 2.5%) and neutrophils (F1 score: 91.7% ± 1.7%), while a specialized binary classifier for monocytes achieved 97.0% ± 2.8% F1 score. In bacteria-spiked whole blood experiments, the system successfully detected activated leukocytes within 30 min at concentrations approaching clinical blood culture detection limits (11.11 ± 4.79 CFU/mL). Neutrophils showed rapid activation peaking at 1-3 h, while lymphocyte activation increased gradually over 6-12 h, consistent with innate versus adaptive immune response kinetics.

Conclusion: This AI-assisted microscopy platform enables rapid, label-free detection of leukocyte activation in response to bacterial infection with minimal sample handling and no requirement for specialized staining or trained technicians. The technology demonstrates potential for accelerating infection diagnosis and could be extended to other pathologies with morphological manifestations.

背景：通过白细胞活化分析快速检测细菌感染可以将诊断时间从几天缩短到几小时。流式细胞术和生物标志物等传统方法面临技术复杂性、设备要求和延迟结果等局限性。方法：建立了一种双人工神经网络系统，结合无染色光学显微镜和微流体技术检测活化白细胞的形态变化。YOLOv4网络在化学刺激的白细胞亚群（淋巴细胞、单核细胞和中性粒细胞）图像上使用五倍交叉验证进行训练，并通过流式细胞术进行验证。该系统在含有临床相关浓度（10-250 CFU/mL）的大肠杆菌的全血样本上进行了测试。结果：优化后的四类网络在淋巴细胞（F1评分：80.1%±2.5%）和中性粒细胞（F1评分：91.7%±1.7%）方面取得了较高的性能指标，而单核细胞的专门二元分类器的F1评分为97.0%±2.8%。在细菌加标全血实验中，该系统在30分钟内成功检测到活性白细胞，浓度接近临床血培养检测限（11.11±4.79 CFU/mL）。中性粒细胞的激活在1-3 h达到峰值，而淋巴细胞的激活在6-12 h逐渐增加，符合先天与适应性免疫反应动力学。结论：这种人工智能辅助显微镜平台能够快速、无标记地检测细菌感染时的白细胞活化，只需最少的样品处理，不需要专门的染色或训练有素的技术人员。该技术显示了加速感染诊断的潜力，并可扩展到其他具有形态学表现的病理。

{"title":"Stain-free artificial intelligence-assisted light microscopy for the identification of leukocyte morphology change in presence of bacteria.","authors":"Alexander Hunt, Holger Schulze, Kay Samuel, Robert B Fisher, Till T Bachmann","doi":"10.3389/fbinf.2025.1725145","DOIUrl":"10.3389/fbinf.2025.1725145","url":null,"abstract":"Background: Rapid detection of bacterial infections through leukocyte activation analysis could significantly reduce diagnostic timeframes from days to hours. Traditional methods like flow cytometry and biomarker assays face limitations including technical complexity, equipment requirements, and delayed results.Methods: We developed a dual artificial neural network system combining stain-free light microscopy with microfluidic technology to detect morphological changes in activated leukocytes. YOLOv4 networks were trained using five-fold cross-validation on images of chemically stimulated leukocyte subpopulations (lymphocytes, monocytes, and neutrophils) and validated against flow cytometry. The system was tested on whole blood samples spiked with E. coli at clinically relevant concentrations (10-250 CFU/mL).Results: The optimized four-class network achieved high performance metrics for lymphocytes (F1 score: 80.1% ± 2.5%) and neutrophils (F1 score: 91.7% ± 1.7%), while a specialized binary classifier for monocytes achieved 97.0% ± 2.8% F1 score. In bacteria-spiked whole blood experiments, the system successfully detected activated leukocytes within 30 min at concentrations approaching clinical blood culture detection limits (11.11 ± 4.79 CFU/mL). Neutrophils showed rapid activation peaking at 1-3 h, while lymphocyte activation increased gradually over 6-12 h, consistent with innate versus adaptive immune response kinetics.Conclusion: This AI-assisted microscopy platform enables rapid, label-free detection of leukocyte activation in response to bacterial infection with minimal sample handling and no requirement for specialized staining or trained technicians. The technology demonstrates potential for accelerating infection diagnosis and could be extended to other pathologies with morphological manifestations.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1725145"},"PeriodicalIF":3.9,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12816271/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146020747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Transformers-based framework for refinement of genetic variants. 基于transformer的遗传变异细化框架。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-05 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1694924

Omar Abdelwahab, Davoud Torkamaneh

Accurate variant calling refinement is crucial for distinguishing true genetic variants from technical artifacts in high-throughput sequencing data. While heuristic filtering and manual review are common approaches for refining variants, manual review is time-consuming, and heuristic filtering often lacks optimal solutions, especially for low-coverage data. Traditional variant calling methods often struggle with accuracy, especially in regions of low read coverage, leading to false-positive or false-negative calls. Advances in artificial intelligence, particularly deep learning, offer promising solutions for automating this refinement process. Here, we present a Transformers-based framework for genetic variant refinement that leverages self-attention to model dependencies among variant features and directly processes VCF files, enabling seamless integration with standard pipelines such as BCFTools and GATK4. Trained on 2 million variants from the GIAB (v4.2.1) sample HG003, the framework achieved 89.26% accuracy and a ROC AUC of 0.88. Across the tested samples, VariantTransformer improved baseline filtering accuracy by 4%-10%, demonstrating consistent gains over the default caller filters. When integrated into conventional variant calling pipelines, VariantTransformer outperformed traditional heuristic filters and, through refinement of existing caller outputs, approached the accuracy achieved by state-of-the-art AI-based variant callers such as DeepVariant, despite not operating as a standalone caller. By positioning this work as a flexible and generalizable framework rather than a single-use model, we highlight the underexplored potential of Transformers for variant refinement in genomics. This study contributes a blueprint for adapting Transformer architectures to a wide range of genomic quality control and filtering tasks. Code is available at: https://github.com/Omar-Abd-Elwahab/VariantTransformer.

在高通量测序数据中，准确的变异调用细化对于区分真正的遗传变异和技术产物至关重要。虽然启发式过滤和手动审查是精炼变量的常用方法，但手动审查非常耗时，并且启发式过滤通常缺乏最佳解决方案，特别是对于低覆盖率数据。传统的变体调用方法在精度上存在一定的问题，特别是在低读覆盖区域，会导致误报或误报调用。人工智能的进步，特别是深度学习，为自动化这一优化过程提供了有希望的解决方案。在这里，我们提出了一个基于transformer的遗传变异改进框架，该框架利用了对变体特征之间模型依赖关系的自我关注，并直接处理VCF文件，从而实现了与标准管道（如BCFTools和GATK4）的无缝集成。该框架对来自GIAB （v4.2.1）样本HG003的200万个变体进行了训练，准确率达到89.26%，ROC AUC为0.88。在测试的样本中，VariantTransformer将基线过滤精度提高了4%-10%，与默认调用者过滤器相比显示出一致的增益。当集成到传统的变体调用管道中时，VariantTransformer优于传统的启发式过滤器，并且通过对现有调用者输出的改进，接近最先进的基于ai的变体调用者（如DeepVariant）所达到的精度，尽管不作为独立的调用者运行。通过将这项工作定位为一个灵活和可推广的框架，而不是单一使用的模型，我们强调了变形金刚在基因组学中变体改进的未充分开发的潜力。这项研究为使Transformer架构适应广泛的基因组质量控制和过滤任务提供了蓝图。代码可从https://github.com/Omar-Abd-Elwahab/VariantTransformer获得。

{"title":"A Transformers-based framework for refinement of genetic variants.","authors":"Omar Abdelwahab, Davoud Torkamaneh","doi":"10.3389/fbinf.2025.1694924","DOIUrl":"10.3389/fbinf.2025.1694924","url":null,"abstract":"Accurate variant calling refinement is crucial for distinguishing true genetic variants from technical artifacts in high-throughput sequencing data. While heuristic filtering and manual review are common approaches for refining variants, manual review is time-consuming, and heuristic filtering often lacks optimal solutions, especially for low-coverage data. Traditional variant calling methods often struggle with accuracy, especially in regions of low read coverage, leading to false-positive or false-negative calls. Advances in artificial intelligence, particularly deep learning, offer promising solutions for automating this refinement process. Here, we present a Transformers-based framework for genetic variant refinement that leverages self-attention to model dependencies among variant features and directly processes VCF files, enabling seamless integration with standard pipelines such as BCFTools and GATK4. Trained on 2 million variants from the GIAB (v4.2.1) sample HG003, the framework achieved 89.26% accuracy and a ROC AUC of 0.88. Across the tested samples, VariantTransformer improved baseline filtering accuracy by 4%-10%, demonstrating consistent gains over the default caller filters. When integrated into conventional variant calling pipelines, VariantTransformer outperformed traditional heuristic filters and, through refinement of existing caller outputs, approached the accuracy achieved by state-of-the-art AI-based variant callers such as DeepVariant, despite not operating as a standalone caller. By positioning this work as a flexible and generalizable framework rather than a single-use model, we highlight the underexplored potential of Transformers for variant refinement in genomics. This study contributes a blueprint for adapting Transformer architectures to a wide range of genomic quality control and filtering tasks. Code is available at: https://github.com/Omar-Abd-Elwahab/VariantTransformer.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1694924"},"PeriodicalIF":3.9,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12813134/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel and accelerated method for integrated alignment and variant calling from short and long reads. 一种新颖的、快速的短、长读取集对和变体调用的方法。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics

Pub Date : 2026-01-05 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1691056

Jinnan Hu, Donald Freed, Hanying Feng, Hong Chen, Zhipan Li, Haodong Chen

Background: Integrating short- and long-read sequencing technologies has become a promising approach for achieving accurate and comprehensive genomic analysis. Although short-read sequencing (Illumina, etc.) offers high base accuracy and cost efficiency, it struggles with structural variant (SV) detection and complex genomic regions. In contrast, long-read sequencing (PacBio HiFi) excels in resolving large SVs and repetitive sequences but is limited by throughput, higher insertion or deletion (indel) error rates, and sequencing costs. Hybrid approaches may combine these technologies and leverage their complementary strengths and different sources of error to provide higher accuracy, more comprehensive results, and higher throughput by lowering the coverage requirement for the long reads.

Methods: This study benchmarks the DNAscope Hybrid (DS-Hybrid) pipeline, a novel integrated alignment and variant calling framework that combines short- and long-read data sequenced from the same sample. The DNAscope Hybrid pipeline is a bioinformatics pipeline that runs on generic x86 CPUs. We evaluate its performance across multiple human genome reference datasets (HG002-HG004) using the draft Q100 and Genome in a Bottle v4.2.1 benchmarks. The pipeline's ability to detect small variants [single-nucleotide polymorphisms (SNPs)/indels)], SVs, and copy-number variations (CNVs) is assessed using data from the Illumina and PacBio sequencing systems at varying read depths (5×-30×). Benchmark results are compared to those of DeepVariant.

Results: The DNAscope Hybrid pipeline significantly improves SNP and indel calling accuracy, particularly in complex genomic regions. At lower long-read depths (e.g., 5×-10×), the hybrid approach outperforms stand-alone short- or long-read pipelines at full sequencing depths (30×-35×), reducing variant calling errors by at least 50%. Additionally, the DNAscope Hybrid outperforms leading open-source tools for SV and CNV detection and enhances variant discovery in challenging genomic regions. The pipeline also demonstrates clinical utility by identifying variants in disease-associated genes. Moreover, DNAscope Hybrid is highly efficient, achieving less than 90 min runtimes at single standard instance.

Conclusion: The DNAscope Hybrid pipeline is a computationally efficient, highly accurate variant calling framework that leverages the advantages of both short- and long-read sequencing. By improving variant detection in challenging genomic regions and offering a robust solution for clinical and large-scale genomic applications, it holds significant promise for genetic disease diagnostics, population-scale studies, and personalized medicine.

背景：整合短读和长读测序技术已经成为实现准确和全面的基因组分析的一种有前途的方法。虽然短读测序（Illumina等）提供了高碱基精度和成本效率，但它在结构变异（SV）检测和复杂基因组区域方面存在困难。相比之下，长读测序（PacBio HiFi）在解决大的sv和重复序列方面表现出色，但受到吞吐量、较高的插入或删除（indel）错误率和测序成本的限制。混合方法可以结合这些技术，并利用它们的互补优势和不同的错误来源，通过降低长读取的覆盖要求来提供更高的准确性、更全面的结果和更高的吞吐量。方法：本研究对DNAscope Hybrid （DS-Hybrid）管道进行基准测试，这是一种新颖的集成比对和变体调用框架，结合了来自同一样本的短读和长读数据测序。DNAscope混合管道是一个运行在通用x86 cpu上的生物信息学管道。我们使用Q100草案和genome in a Bottle v4.2.1基准评估了其在多个人类基因组参考数据集（HG002-HG004）上的性能。该管线检测小变异[单核苷酸多态性(snp)/indels)]、SVs和拷贝数变异（cnv）的能力是使用Illumina和PacBio测序系统在不同读取深度下的数据进行评估的（5×-30×）。基准测试结果与DeepVariant的结果进行了比较。结果：DNAscope杂交管道显著提高了SNP和indel调用的准确性，特别是在复杂的基因组区域。在较低的长读深度（例如，5×-10×）下，混合方法在全测序深度（30×-35×）下优于独立的短读或长读管道，将变体调用错误减少至少50%。此外，DNAscope Hybrid在SV和CNV检测方面优于领先的开源工具，并增强了在具有挑战性的基因组区域的变异发现。该管道还通过识别疾病相关基因的变异证明了其临床实用性。此外，DNAscope Hybrid非常高效，单个标准实例的运行时间不到90分钟。结论：DNAscope杂交管道是一个计算效率高，高度准确的变体调用框架，利用了短读和长读测序的优势。通过改进具有挑战性的基因组区域的变异检测，并为临床和大规模基因组应用提供强大的解决方案，它在遗传病诊断、人群规模研究和个性化医疗方面具有重要的前景。

{"title":"A novel and accelerated method for integrated alignment and variant calling from short and long reads.","authors":"Jinnan Hu, Donald Freed, Hanying Feng, Hong Chen, Zhipan Li, Haodong Chen","doi":"10.3389/fbinf.2025.1691056","DOIUrl":"10.3389/fbinf.2025.1691056","url":null,"abstract":"Background: Integrating short- and long-read sequencing technologies has become a promising approach for achieving accurate and comprehensive genomic analysis. Although short-read sequencing (Illumina, etc.) offers high base accuracy and cost efficiency, it struggles with structural variant (SV) detection and complex genomic regions. In contrast, long-read sequencing (PacBio HiFi) excels in resolving large SVs and repetitive sequences but is limited by throughput, higher insertion or deletion (indel) error rates, and sequencing costs. Hybrid approaches may combine these technologies and leverage their complementary strengths and different sources of error to provide higher accuracy, more comprehensive results, and higher throughput by lowering the coverage requirement for the long reads.Methods: This study benchmarks the DNAscope Hybrid (DS-Hybrid) pipeline, a novel integrated alignment and variant calling framework that combines short- and long-read data sequenced from the same sample. The DNAscope Hybrid pipeline is a bioinformatics pipeline that runs on generic x86 CPUs. We evaluate its performance across multiple human genome reference datasets (HG002-HG004) using the draft Q100 and Genome in a Bottle v4.2.1 benchmarks. The pipeline's ability to detect small variants [single-nucleotide polymorphisms (SNPs)/indels)], SVs, and copy-number variations (CNVs) is assessed using data from the Illumina and PacBio sequencing systems at varying read depths (5×-30×). Benchmark results are compared to those of DeepVariant.Results: The DNAscope Hybrid pipeline significantly improves SNP and indel calling accuracy, particularly in complex genomic regions. At lower long-read depths (e.g., 5×-10×), the hybrid approach outperforms stand-alone short- or long-read pipelines at full sequencing depths (30×-35×), reducing variant calling errors by at least 50%. Additionally, the DNAscope Hybrid outperforms leading open-source tools for SV and CNV detection and enhances variant discovery in challenging genomic regions. The pipeline also demonstrates clinical utility by identifying variants in disease-associated genes. Moreover, DNAscope Hybrid is highly efficient, achieving less than 90 min runtimes at single standard instance.Conclusion: The DNAscope Hybrid pipeline is a computationally efficient, highly accurate variant calling framework that leverages the advantages of both short- and long-read sequencing. By improving variant detection in challenging genomic regions and offering a robust solution for clinical and large-scale genomic applications, it holds significant promise for genetic disease diagnostics, population-scale studies, and personalized medicine.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1691056"},"PeriodicalIF":3.9,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12813096/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0