Kai Shi, Qiaohui Liu, Qingrong Ji, Qisheng He, Xing-Ming Zhao
The gut microbiota plays a vital role in human health, and significant effort has been made to predict human phenotypes, especially diseases, with the microbiota as a promising indicator or predictor with machine learning (ML) methods. However, the accuracy is impacted by a lot of factors when predicting host phenotypes with the metagenomic data, e.g. small sample size, class imbalance, high-dimensional features, etc. To address these challenges, we propose MicroHDF, an interpretable deep learning framework to predict host phenotypes, where a cascade layers of deep forest units is designed for handling sample class imbalance and high dimensional features. The experimental results show that the performance of MicroHDF is competitive with that of existing state-of-the-art methods on 13 publicly available datasets of six different diseases. In particular, it performs best with the area under the receiver operating characteristic curve of 0.9182 ± 0.0098 and 0.9469 ± 0.0076 for inflammatory bowel disease (IBD) and liver cirrhosis, respectively. Our MicroHDF also shows better performance and robustness in cross-study validation. Furthermore, MicroHDF is applied to two high-risk diseases, IBD and autism spectrum disorder, as case studies to identify potential biomarkers. In conclusion, our method provides an effective and reliable prediction of the host phenotype and discovers informative features with biological insights.
{"title":"MicroHDF: predicting host phenotypes with metagenomic data using a deep forest-based framework.","authors":"Kai Shi, Qiaohui Liu, Qingrong Ji, Qisheng He, Xing-Ming Zhao","doi":"10.1093/bib/bbae530","DOIUrl":"10.1093/bib/bbae530","url":null,"abstract":"<p><p>The gut microbiota plays a vital role in human health, and significant effort has been made to predict human phenotypes, especially diseases, with the microbiota as a promising indicator or predictor with machine learning (ML) methods. However, the accuracy is impacted by a lot of factors when predicting host phenotypes with the metagenomic data, e.g. small sample size, class imbalance, high-dimensional features, etc. To address these challenges, we propose MicroHDF, an interpretable deep learning framework to predict host phenotypes, where a cascade layers of deep forest units is designed for handling sample class imbalance and high dimensional features. The experimental results show that the performance of MicroHDF is competitive with that of existing state-of-the-art methods on 13 publicly available datasets of six different diseases. In particular, it performs best with the area under the receiver operating characteristic curve of 0.9182 ± 0.0098 and 0.9469 ± 0.0076 for inflammatory bowel disease (IBD) and liver cirrhosis, respectively. Our MicroHDF also shows better performance and robustness in cross-study validation. Furthermore, MicroHDF is applied to two high-risk diseases, IBD and autism spectrum disorder, as case studies to identify potential biomarkers. In conclusion, our method provides an effective and reliable prediction of the host phenotype and discovers informative features with biological insights.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500453/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142516299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A key advantage of single-cell multimodal joint profiling is the modality interplay, which is essential for deciphering the cell fate. However, while current analytical methods can leverage the additive benefits, they fall short to explore the synergistic insights of joint profiling, thereby diminishing the advantage of joint profiling. Here, we introduce CellMATE, a Multi-head Adversarial Training-based Early-integration approach specifically developed for multimodal joint profiling. CellMATE can capture both additive and synergistic benefits inherent in joint profiling through auto-learning of multimodal distributions and simultaneously represents all features into a unified latent space. Through extensive evaluation across diverse joint profiling scenarios, CellMATE demonstrated its superiority in ensuring utility of cross-modal properties, uncovering cellular heterogeneity and plasticity, and delineating differentiation trajectories. CellMATE uniquely unlocks the full potential of joint profiling to elucidate the dynamic nature of cells during critical processes as differentiation, development, and diseases.
{"title":"Unlocking cross-modal interplay of single-cell joint profiling with CellMATE.","authors":"Qi Wang, Bolei Zhang, Yue Guo, Luyu Gong, Erguang Li, Jingping Yang","doi":"10.1093/bib/bbae582","DOIUrl":"https://doi.org/10.1093/bib/bbae582","url":null,"abstract":"<p><p>A key advantage of single-cell multimodal joint profiling is the modality interplay, which is essential for deciphering the cell fate. However, while current analytical methods can leverage the additive benefits, they fall short to explore the synergistic insights of joint profiling, thereby diminishing the advantage of joint profiling. Here, we introduce CellMATE, a Multi-head Adversarial Training-based Early-integration approach specifically developed for multimodal joint profiling. CellMATE can capture both additive and synergistic benefits inherent in joint profiling through auto-learning of multimodal distributions and simultaneously represents all features into a unified latent space. Through extensive evaluation across diverse joint profiling scenarios, CellMATE demonstrated its superiority in ensuring utility of cross-modal properties, uncovering cellular heterogeneity and plasticity, and delineating differentiation trajectories. CellMATE uniquely unlocks the full potential of joint profiling to elucidate the dynamic nature of cells during critical processes as differentiation, development, and diseases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueping Zhou, Manqi Cai, Molin Yue, Juan C Celedón, Jiebiao Wang, Ying Ding, Wei Chen, Yanming Li
We propose a supervised learning bioinformatics tool, Biological gRoup guIded muLtivariate muLtiple lIneAr regression with peNalizaTion (Brilliant), designed for feature selection and outcome prediction in genomic data with multi-phenotypic responses. Brilliant specifically incorporates genome and/or phenotype grouping structures, as well as phenotype correlation structures, in feature selection, effect estimation, and outcome prediction under a penalized multi-response linear regression model. Extensive simulations demonstrate its superior performance compared to competing methods. We applied Brilliant to two omics studies. In the first study, we identified novel association signals between multivariate gene expressions and high-dimensional DNA methylation profiles, providing biological insights for the baseline CpG-to-gene regulation patterns in a Puerto Rican children asthma cohort. The second study focused on cell-type deconvolution prediction using high-dimensional gene expression profiles. Using Brilliant, we improved the accuracy for cell-type fraction prediction and identified novel cell-type signature genes.
我们提出了一种生物信息学监督学习工具--生物组指导的多变量多反应线性回归(Biological gRoup guIded muLtivariate muLtiple lIneAr regression with peNalizaTion,Brilliant),该工具设计用于具有多表型反应的基因组数据的特征选择和结果预测。Brilliant 特别将基因组和/或表型分组结构以及表型相关结构纳入特征选择、效应估计和受惩罚多反应线性回归模型下的结果预测中。大量的模拟证明,与同类方法相比,Brilliant 的性能更优越。我们将 Brilliant 应用于两项 omics 研究。在第一项研究中,我们在多变量基因表达和高维 DNA 甲基化图谱之间发现了新的关联信号,为波多黎各儿童哮喘队列中的基线 CpG 基因调控模式提供了生物学见解。第二项研究的重点是利用高维基因表达谱进行细胞类型解旋预测。利用 Brilliant,我们提高了细胞类型分数预测的准确性,并确定了新的细胞类型特征基因。
{"title":"Molecular group and correlation guided structural learning for multi-phenotype prediction.","authors":"Xueping Zhou, Manqi Cai, Molin Yue, Juan C Celedón, Jiebiao Wang, Ying Ding, Wei Chen, Yanming Li","doi":"10.1093/bib/bbae585","DOIUrl":"10.1093/bib/bbae585","url":null,"abstract":"<p><p>We propose a supervised learning bioinformatics tool, Biological gRoup guIded muLtivariate muLtiple lIneAr regression with peNalizaTion (Brilliant), designed for feature selection and outcome prediction in genomic data with multi-phenotypic responses. Brilliant specifically incorporates genome and/or phenotype grouping structures, as well as phenotype correlation structures, in feature selection, effect estimation, and outcome prediction under a penalized multi-response linear regression model. Extensive simulations demonstrate its superior performance compared to competing methods. We applied Brilliant to two omics studies. In the first study, we identified novel association signals between multivariate gene expressions and high-dimensional DNA methylation profiles, providing biological insights for the baseline CpG-to-gene regulation patterns in a Puerto Rican children asthma cohort. The second study focused on cell-type deconvolution prediction using high-dimensional gene expression profiles. Using Brilliant, we improved the accuracy for cell-type fraction prediction and identified novel cell-type signature genes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562839/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Limuxuan He, Quan Zou, Qi Dai, Shuang Cheng, Yansu Wang
Background: Microorganisms inhabit various regions of the human body and significantly contribute to numerous diseases. Predicting the associations between microbes and diseases is crucial for understanding pathogenic mechanisms and informing prevention and treatment strategies. Biological experiments to determine these associations are time-consuming and costly. Therefore, integrating deep learning with biological networks can efficiently identify potential microbe-disease associations on a large scale.
Methods: We propose an adversarial regularized autoencoder graph neural network algorithm, named Stacked Adversarial Regularization for Microbe-Disease Associations Prediction (SARMDA), for predicting associations between microbes and diseases. First, we integrate topological structural similarity and functional similarity metrics of microbes and diseases to construct a heterogeneous network. Then, utilizing an autoencoder based on GraphSAGE, we learn both the topological and attribute representations of nodes within the constructed network. Finally, we introduce an adversarial regularized autoencoder graph neural network embedding model to address the inherent limitations of traditional GraphSAGE autoencoders in capturing global information.
Results: Under the five-fold cross-validation on microbe-disease pairs, SARMDA was compared with eight advanced methods using the Human Microbe-Disease Association Database (HMDAD) and Disbiome databases. The best area under the ROC curve (AUC) achieved by SARMDA on HMDAD was 0.9891$pm$0.0057, and the best area under the precision-recall curve (AUPR) was 0.9902$pm$0.0128. On the Disbiome dataset, the AUC was 0.9328$pm$0.0072, and the best AUPR was 0.9233$pm$0.0089, outperforming the other eight MDAs prediction methods. Furthermore, the effectiveness of our model was demonstrated through a detailed analysis of asthma and inflammatory bowel disease cases.
{"title":"Adversarial regularized autoencoder graph neural network for microbe-disease associations prediction.","authors":"Limuxuan He, Quan Zou, Qi Dai, Shuang Cheng, Yansu Wang","doi":"10.1093/bib/bbae584","DOIUrl":"10.1093/bib/bbae584","url":null,"abstract":"<p><strong>Background: </strong>Microorganisms inhabit various regions of the human body and significantly contribute to numerous diseases. Predicting the associations between microbes and diseases is crucial for understanding pathogenic mechanisms and informing prevention and treatment strategies. Biological experiments to determine these associations are time-consuming and costly. Therefore, integrating deep learning with biological networks can efficiently identify potential microbe-disease associations on a large scale.</p><p><strong>Methods: </strong>We propose an adversarial regularized autoencoder graph neural network algorithm, named Stacked Adversarial Regularization for Microbe-Disease Associations Prediction (SARMDA), for predicting associations between microbes and diseases. First, we integrate topological structural similarity and functional similarity metrics of microbes and diseases to construct a heterogeneous network. Then, utilizing an autoencoder based on GraphSAGE, we learn both the topological and attribute representations of nodes within the constructed network. Finally, we introduce an adversarial regularized autoencoder graph neural network embedding model to address the inherent limitations of traditional GraphSAGE autoencoders in capturing global information.</p><p><strong>Results: </strong>Under the five-fold cross-validation on microbe-disease pairs, SARMDA was compared with eight advanced methods using the Human Microbe-Disease Association Database (HMDAD) and Disbiome databases. The best area under the ROC curve (AUC) achieved by SARMDA on HMDAD was 0.9891$pm$0.0057, and the best area under the precision-recall curve (AUPR) was 0.9902$pm$0.0128. On the Disbiome dataset, the AUC was 0.9328$pm$0.0072, and the best AUPR was 0.9233$pm$0.0089, outperforming the other eight MDAs prediction methods. Furthermore, the effectiveness of our model was demonstrated through a detailed analysis of asthma and inflammatory bowel disease cases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11554402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingjun Ji, Qing Yu, Xin-Zhuang Yang, Xianhong Yu, Jiaxin Wang, Chunfu Xiao, Ni A An, Chuanhui Han, Chuan-Yun Li, Wanqiu Ding
Recent advances in neoantigen research have accelerated the development of immunotherapies for cancers, such as glioblastoma (GBM). Neoantigens resulting from genomic mutations and dysregulated alternative splicing have been studied in GBM. However, these studies have primarily focused on annotated alternatively-spliced transcripts, leaving non-annotated transcripts largely unexplored. Circular ribonucleic acids (circRNAs), abnormally regulated in tumors, are correlated with the presence of non-annotated linear transcripts with exon skipping events. But the extent to which these linear transcripts truly exist and their functions in cancer immunotherapies remain unknown. Here, we found the ubiquitous co-occurrence of circRNA biogenesis and alternative splicing across various tumor types, resulting in large amounts of long-range alternatively-spliced transcripts (LRs). By comparing tumor and healthy tissues, we identified tumor-specific LRs more abundant in GBM than in normal tissues and other tumor types. This may be attributable to the upregulation of the protein quaking in GBM, which is reported to promote circRNA biogenesis. In total, we identified 1057 specific and recurrent LRs in GBM. Through in silico translation prediction and MS-based immunopeptidome analysis, 16 major histocompatibility complex class I-associated peptides were identified as potential immunotherapy targets in GBM. This study revealed long-range alternatively-spliced transcripts specifically upregulated in GBM may serve as recurrent, immunogenic tumor-specific antigens.
{"title":"Long-range alternative splicing contributes to neoantigen specificity in glioblastoma.","authors":"Mingjun Ji, Qing Yu, Xin-Zhuang Yang, Xianhong Yu, Jiaxin Wang, Chunfu Xiao, Ni A An, Chuanhui Han, Chuan-Yun Li, Wanqiu Ding","doi":"10.1093/bib/bbae503","DOIUrl":"https://doi.org/10.1093/bib/bbae503","url":null,"abstract":"<p><p>Recent advances in neoantigen research have accelerated the development of immunotherapies for cancers, such as glioblastoma (GBM). Neoantigens resulting from genomic mutations and dysregulated alternative splicing have been studied in GBM. However, these studies have primarily focused on annotated alternatively-spliced transcripts, leaving non-annotated transcripts largely unexplored. Circular ribonucleic acids (circRNAs), abnormally regulated in tumors, are correlated with the presence of non-annotated linear transcripts with exon skipping events. But the extent to which these linear transcripts truly exist and their functions in cancer immunotherapies remain unknown. Here, we found the ubiquitous co-occurrence of circRNA biogenesis and alternative splicing across various tumor types, resulting in large amounts of long-range alternatively-spliced transcripts (LRs). By comparing tumor and healthy tissues, we identified tumor-specific LRs more abundant in GBM than in normal tissues and other tumor types. This may be attributable to the upregulation of the protein quaking in GBM, which is reported to promote circRNA biogenesis. In total, we identified 1057 specific and recurrent LRs in GBM. Through in silico translation prediction and MS-based immunopeptidome analysis, 16 major histocompatibility complex class I-associated peptides were identified as potential immunotherapy targets in GBM. This study revealed long-range alternatively-spliced transcripts specifically upregulated in GBM may serve as recurrent, immunogenic tumor-specific antigens.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472750/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jessica Butts, Leif Verace, Christine Wendt, Russel P Bowler, Craig P Hersh, Qi Long, Lynn Eberly, Sandra E Safo
Epidemiologic and genetic studies in many complex diseases suggest subgroup disparities (e.g. by sex, race) in disease course and patient outcomes. We consider this from the standpoint of integrative analysis where we combine information from different views (e.g. genomics, proteomics, clinical data). Existing integrative analysis methods ignore the heterogeneity in subgroups, and stacking the views and accounting for subgroup heterogeneity does not model the association among the views. We propose Heterogeneity in Integration and Prediction (HIP), a statistical approach for joint association and prediction that leverages the strengths in each view to identify molecular signatures that are shared by and specific to a subgroup. We apply HIP to proteomics and gene expression data pertaining to chronic obstructive pulmonary disease (COPD) to identify proteins and genes shared by, and unique to, males and females, contributing to the variation in COPD, measured by airway wall thickness. Our COPD findings have identified proteins, genes, and pathways that are common across and specific to males and females, some implicated in COPD, while others could lead to new insights into sex differences in COPD mechanisms. HIP accounts for subgroup heterogeneity in multi-view data, ranks variables based on importance, is applicable to univariate or multivariate continuous outcomes, and incorporates covariate adjustment. With the efficient algorithms implemented using PyTorch, this method has many potential scientific applications and could enhance multiomics research in health disparities. HIP is available at https://github.com/lasandrall/HIP, a video tutorial at https://youtu.be/O6E2OLmeMDo and a Shiny Application at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/ for users with limited programming experience.
{"title":"HIP: a method for high-dimensional multi-view data integration and prediction accounting for subgroup heterogeneity.","authors":"Jessica Butts, Leif Verace, Christine Wendt, Russel P Bowler, Craig P Hersh, Qi Long, Lynn Eberly, Sandra E Safo","doi":"10.1093/bib/bbae470","DOIUrl":"10.1093/bib/bbae470","url":null,"abstract":"<p><p>Epidemiologic and genetic studies in many complex diseases suggest subgroup disparities (e.g. by sex, race) in disease course and patient outcomes. We consider this from the standpoint of integrative analysis where we combine information from different views (e.g. genomics, proteomics, clinical data). Existing integrative analysis methods ignore the heterogeneity in subgroups, and stacking the views and accounting for subgroup heterogeneity does not model the association among the views. We propose Heterogeneity in Integration and Prediction (HIP), a statistical approach for joint association and prediction that leverages the strengths in each view to identify molecular signatures that are shared by and specific to a subgroup. We apply HIP to proteomics and gene expression data pertaining to chronic obstructive pulmonary disease (COPD) to identify proteins and genes shared by, and unique to, males and females, contributing to the variation in COPD, measured by airway wall thickness. Our COPD findings have identified proteins, genes, and pathways that are common across and specific to males and females, some implicated in COPD, while others could lead to new insights into sex differences in COPD mechanisms. HIP accounts for subgroup heterogeneity in multi-view data, ranks variables based on importance, is applicable to univariate or multivariate continuous outcomes, and incorporates covariate adjustment. With the efficient algorithms implemented using PyTorch, this method has many potential scientific applications and could enhance multiomics research in health disparities. HIP is available at https://github.com/lasandrall/HIP, a video tutorial at https://youtu.be/O6E2OLmeMDo and a Shiny Application at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/ for users with limited programming experience.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11440091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Cui, Yadong Liu, Xian Yu, Hongzhe Guo, Tao Jiang, Yadong Wang, Bo Liu
Nanopore sequence technology has demonstrated a longer read length and enabled to potentially address the limitations of short-read sequencing including long-range haplotype phasing and accurate variant calling. However, there is still room for improvement in terms of the performance of single nucleotide variant (SNV) identification and computing resource usage for the state-of-the-art approaches. In this work, we introduce miniSNV, a lightweight SNV calling algorithm that simultaneously achieves high performance and yield. miniSNV utilizes known common variants in populations as variation backgrounds and leverages read pileup, read-based phasing, and consensus generation to identify and genotype SNVs for Oxford Nanopore Technologies (ONT) long reads. Benchmarks on real and simulated ONT data under various error profiles demonstrate that miniSNV has superior sensitivity and comparable accuracy on SNV detection and runs faster with outstanding scalability and lower memory than most state-of-the-art variant callers. miniSNV is available from https://github.com/CuiMiao-HIT/miniSNV.
{"title":"miniSNV: accurate and fast single nucleotide variant calling from nanopore sequencing data.","authors":"Miao Cui, Yadong Liu, Xian Yu, Hongzhe Guo, Tao Jiang, Yadong Wang, Bo Liu","doi":"10.1093/bib/bbae473","DOIUrl":"https://doi.org/10.1093/bib/bbae473","url":null,"abstract":"<p><p>Nanopore sequence technology has demonstrated a longer read length and enabled to potentially address the limitations of short-read sequencing including long-range haplotype phasing and accurate variant calling. However, there is still room for improvement in terms of the performance of single nucleotide variant (SNV) identification and computing resource usage for the state-of-the-art approaches. In this work, we introduce miniSNV, a lightweight SNV calling algorithm that simultaneously achieves high performance and yield. miniSNV utilizes known common variants in populations as variation backgrounds and leverages read pileup, read-based phasing, and consensus generation to identify and genotype SNVs for Oxford Nanopore Technologies (ONT) long reads. Benchmarks on real and simulated ONT data under various error profiles demonstrate that miniSNV has superior sensitivity and comparable accuracy on SNV detection and runs faster with outstanding scalability and lower memory than most state-of-the-art variant callers. miniSNV is available from https://github.com/CuiMiao-HIT/miniSNV.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11428505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yetong Zhou, Shengming Zhou, Yue Bi, Quan Zou, Cangzhi Jia
Liquid-liquid phase separation (LLPS) is one of the mechanisms mediating the compartmentalization of macromolecules (proteins and nucleic acids) in cells, forming biomolecular condensates or membraneless organelles. Consequently, the systematic identification of potential LLPS proteins is crucial for understanding the phase separation process and its biological mechanisms. A two-task predictor, Opt_PredLLPS, was developed to discover potential phase separation proteins and further evaluate their mechanism. The first task model of Opt_PredLLPS combines a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) through a fully connected layer, where the CNN utilizes evolutionary information features as input, and BiLSTM utilizes multimodal features as input. If a protein is predicted to be an LLPS protein, it is input into the second task model to predict whether this protein needs to interact with its partners to undergo LLPS. The second task model employs the XGBoost classification algorithm and 37 physicochemical properties following a three-step feature selection. The effectiveness of the model was validated on multiple benchmark datasets, and in silico saturation mutagenesis was used to identify regions that play a key role in phase separation. These findings may assist future research on the LLPS mechanism and the discovery of potential phase separation proteins.
{"title":"A two-task predictor for discovering phase separation proteins and their undergoing mechanism.","authors":"Yetong Zhou, Shengming Zhou, Yue Bi, Quan Zou, Cangzhi Jia","doi":"10.1093/bib/bbae528","DOIUrl":"10.1093/bib/bbae528","url":null,"abstract":"<p><p>Liquid-liquid phase separation (LLPS) is one of the mechanisms mediating the compartmentalization of macromolecules (proteins and nucleic acids) in cells, forming biomolecular condensates or membraneless organelles. Consequently, the systematic identification of potential LLPS proteins is crucial for understanding the phase separation process and its biological mechanisms. A two-task predictor, Opt_PredLLPS, was developed to discover potential phase separation proteins and further evaluate their mechanism. The first task model of Opt_PredLLPS combines a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) through a fully connected layer, where the CNN utilizes evolutionary information features as input, and BiLSTM utilizes multimodal features as input. If a protein is predicted to be an LLPS protein, it is input into the second task model to predict whether this protein needs to interact with its partners to undergo LLPS. The second task model employs the XGBoost classification algorithm and 37 physicochemical properties following a three-step feature selection. The effectiveness of the model was validated on multiple benchmark datasets, and in silico saturation mutagenesis was used to identify regions that play a key role in phase separation. These findings may assist future research on the LLPS mechanism and the discovery of potential phase separation proteins.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492799/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Wang, Ziqi Liu, Wei Zhang, Yanjun Li, Yizhen Feng, Shaokang Lv, Han Diao, Zhaofeng Luo, Pengju Yan, Min He, Xiaolin Li
Aptamers are single-stranded nucleic acid ligands, featuring high affinity and specificity to target molecules. Traditionally they are identified from large DNA/RNA libraries using $in vitro$ methods, like Systematic Evolution of Ligands by Exponential Enrichment (SELEX). However, these libraries capture only a small fraction of theoretical sequence space, and various aptamer candidates are constrained by actual sequencing capabilities from the experiment. Addressing this, we proposed AptaDiff, the first in silico aptamer design and optimization method based on the diffusion model. Our Aptadiff can generate aptamers beyond the constraints of high-throughput sequencing data, leveraging motif-dependent latent embeddings from variational autoencoder, and can optimize aptamers by affinity-guided aptamer generation according to Bayesian optimization. Comparative evaluations revealed AptaDiff's superiority over existing aptamer generation methods in terms of quality and fidelity across four high-throughput screening data targeting distinct proteins. Moreover, surface plasmon resonance experiments were conducted to validate the binding affinity of aptamers generated through Bayesian optimization for two target proteins. The results unveiled a significant boost of $87.9%$ and $60.2%$ in RU values, along with a 3.6-fold and 2.4-fold decrease in KD values for the respective target proteins. Notably, the optimized aptamers demonstrated superior binding affinity compared to top experimental candidates selected through SELEX, underscoring the promising outcomes of our AptaDiff in accelerating the discovery of superior aptamers.
{"title":"AptaDiff: de novo design and optimization of aptamers based on diffusion models.","authors":"Zhen Wang, Ziqi Liu, Wei Zhang, Yanjun Li, Yizhen Feng, Shaokang Lv, Han Diao, Zhaofeng Luo, Pengju Yan, Min He, Xiaolin Li","doi":"10.1093/bib/bbae517","DOIUrl":"10.1093/bib/bbae517","url":null,"abstract":"<p><p>Aptamers are single-stranded nucleic acid ligands, featuring high affinity and specificity to target molecules. Traditionally they are identified from large DNA/RNA libraries using $in vitro$ methods, like Systematic Evolution of Ligands by Exponential Enrichment (SELEX). However, these libraries capture only a small fraction of theoretical sequence space, and various aptamer candidates are constrained by actual sequencing capabilities from the experiment. Addressing this, we proposed AptaDiff, the first in silico aptamer design and optimization method based on the diffusion model. Our Aptadiff can generate aptamers beyond the constraints of high-throughput sequencing data, leveraging motif-dependent latent embeddings from variational autoencoder, and can optimize aptamers by affinity-guided aptamer generation according to Bayesian optimization. Comparative evaluations revealed AptaDiff's superiority over existing aptamer generation methods in terms of quality and fidelity across four high-throughput screening data targeting distinct proteins. Moreover, surface plasmon resonance experiments were conducted to validate the binding affinity of aptamers generated through Bayesian optimization for two target proteins. The results unveiled a significant boost of $87.9%$ and $60.2%$ in RU values, along with a 3.6-fold and 2.4-fold decrease in KD values for the respective target proteins. Notably, the optimized aptamers demonstrated superior binding affinity compared to top experimental candidates selected through SELEX, underscoring the promising outcomes of our AptaDiff in accelerating the discovery of superior aptamers.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11491854/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quang-Huy Nguyen, Ha Nguyen, Edwin C Oh, Tin Nguyen
Metabolite profiling is a powerful approach for the clinical diagnosis of complex diseases, ranging from cardiometabolic diseases, cancer, and cognitive disorders to respiratory pathologies and conditions that involve dysregulated metabolism. Because of the importance of systems-level interpretation, many methods have been developed to identify biologically significant pathways using metabolomics data. In this review, we first describe a complete metabolomics workflow (sample preparation, data acquisition, pre-processing, downstream analysis, etc.). We then comprehensively review 24 approaches capable of performing functional analysis, including those that combine metabolomics data with other types of data to investigate the disease-relevant changes at multiple omics layers. We discuss their availability, implementation, capability for pre-processing and quality control, supported omics types, embedded databases, pathway analysis methodologies, and integration techniques. We also provide a rating and evaluation of each software, focusing on their key technique, software accessibility, documentation, and user-friendliness. Following our guideline, life scientists can easily choose a suitable method depending on method rating, available data, input format, and method category. More importantly, we highlight outstanding challenges and potential solutions that need to be addressed by future research. To further assist users in executing the reviewed methods, we provide wrappers of the software packages at https://github.com/tinnlab/metabolite-pathway-review-docker.
{"title":"Current approaches and outstanding challenges of functional annotation of metabolites: a comprehensive review.","authors":"Quang-Huy Nguyen, Ha Nguyen, Edwin C Oh, Tin Nguyen","doi":"10.1093/bib/bbae498","DOIUrl":"https://doi.org/10.1093/bib/bbae498","url":null,"abstract":"<p><p>Metabolite profiling is a powerful approach for the clinical diagnosis of complex diseases, ranging from cardiometabolic diseases, cancer, and cognitive disorders to respiratory pathologies and conditions that involve dysregulated metabolism. Because of the importance of systems-level interpretation, many methods have been developed to identify biologically significant pathways using metabolomics data. In this review, we first describe a complete metabolomics workflow (sample preparation, data acquisition, pre-processing, downstream analysis, etc.). We then comprehensively review 24 approaches capable of performing functional analysis, including those that combine metabolomics data with other types of data to investigate the disease-relevant changes at multiple omics layers. We discuss their availability, implementation, capability for pre-processing and quality control, supported omics types, embedded databases, pathway analysis methodologies, and integration techniques. We also provide a rating and evaluation of each software, focusing on their key technique, software accessibility, documentation, and user-friendliness. Following our guideline, life scientists can easily choose a suitable method depending on method rating, available data, input format, and method category. More importantly, we highlight outstanding challenges and potential solutions that need to be addressed by future research. To further assist users in executing the reviewed methods, we provide wrappers of the software packages at https://github.com/tinnlab/metabolite-pathway-review-docker.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}