Modular response analysis (MRA) is an effective method to infer biological networks from perturbation data. However, it has several limitations such as strong sensitivity to noise, need of performing independent perturbations that hit a single node at a time, and linear approximation of dependencies within the network. Previously, we addressed the sensitivity of MRA to noise by reinterpreting MRA as a multilinear regression problem. We demonstrated the advantages of this approach over the conventional MRA and other known inference methods, particularly in handling noise measurements and nonlinear networks. Here, we provide new contributions to complement this theory. First, we overcome the need of perturbations to be independent, thereby augmenting MRA applicability. Second, using analysis of variance and lack-of-fit tests, we can now assess MRA compatibility with the data and identify the primary source of errors. In cases where nonlinearity prevails, we propose extending the model to a second-order polynomial. Third, we demonstrate how to effectively use prior knowledge about a network. We validated these results using 4 networks with known dynamics (3, 4, and 6 nodes) and 40 simulated networks, ranging from 10 to 200 nodes. Finally, we incorporated these innovations into our R software package MRARegress to offer a comprehensive, extended theory for MRA and to facilitate its use by the community. Mathematical aspects, tests details, and scripts are provided as Supplementary Information (see 'Data Availability Statement').
{"title":"Testing and overcoming the limitations of modular response analysis.","authors":"Jean-Pierre Borg, Jacques Colinge, Patrice Ravel","doi":"10.1093/bib/bbaf098","DOIUrl":"10.1093/bib/bbaf098","url":null,"abstract":"<p><p>Modular response analysis (MRA) is an effective method to infer biological networks from perturbation data. However, it has several limitations such as strong sensitivity to noise, need of performing independent perturbations that hit a single node at a time, and linear approximation of dependencies within the network. Previously, we addressed the sensitivity of MRA to noise by reinterpreting MRA as a multilinear regression problem. We demonstrated the advantages of this approach over the conventional MRA and other known inference methods, particularly in handling noise measurements and nonlinear networks. Here, we provide new contributions to complement this theory. First, we overcome the need of perturbations to be independent, thereby augmenting MRA applicability. Second, using analysis of variance and lack-of-fit tests, we can now assess MRA compatibility with the data and identify the primary source of errors. In cases where nonlinearity prevails, we propose extending the model to a second-order polynomial. Third, we demonstrate how to effectively use prior knowledge about a network. We validated these results using 4 networks with known dynamics (3, 4, and 6 nodes) and 40 simulated networks, ranging from 10 to 200 nodes. Finally, we incorporated these innovations into our R software package MRARegress to offer a comprehensive, extended theory for MRA and to facilitate its use by the community. Mathematical aspects, tests details, and scripts are provided as Supplementary Information (see 'Data Availability Statement').</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleix Boquet-Pujadas, Jian Zeng, Ye Ella Tian, Zhijian Yang, Li Shen, Andrew Zalesky, Christos Davatzikos, Junhao Wen
Artificial intelligence (AI) has been increasingly integrated into imaging genetics to provide intermediate phenotypes (i.e. endophenotypes) that bridge the genetics and clinical manifestations of human disease. However, the genetic architecture of these AI endophenotypes remains largely unexplored in the context of human multiorgan system diseases. Using publicly available genome-wide association study summary statistics from the UK Biobank (UKBB), FinnGen, and the Psychiatric Genomics Consortium, we comprehensively depicted the genetic architecture of 2024 multiorgan AI endophenotypes (MAEs). We comparatively assessed the single-nucleotide polymorphism-based heritability, polygenicity, and natural selection signatures of 2024 MAEs using methods commonly used in the field. Genetic correlation and Mendelian randomization analyses reveal both within-organ relationships and cross-organ interconnections. Bi-directional causal relationships were established between chronic human diseases and MAEs across multiple organ systems, including Alzheimer's disease for the brain, diabetes for the metabolic system, asthma for the pulmonary system, and hypertension for the cardiovascular system. Finally, we derived polygenic risk scores for the 2024 MAEs for individuals not used to calculate MAEs and returned these to the UKBB. Our findings underscore the promise of the MAEs as new instruments to ameliorate overall human health. All results are encapsulated into the MUlTiorgan AI endophenoTypE genetic atlas and are publicly available at https://labs-laboratory.com/mutate.
{"title":"MUTATE: a human genetic atlas of multiorgan artificial intelligence endophenotypes using genome-wide association summary statistics.","authors":"Aleix Boquet-Pujadas, Jian Zeng, Ye Ella Tian, Zhijian Yang, Li Shen, Andrew Zalesky, Christos Davatzikos, Junhao Wen","doi":"10.1093/bib/bbaf125","DOIUrl":"10.1093/bib/bbaf125","url":null,"abstract":"<p><p>Artificial intelligence (AI) has been increasingly integrated into imaging genetics to provide intermediate phenotypes (i.e. endophenotypes) that bridge the genetics and clinical manifestations of human disease. However, the genetic architecture of these AI endophenotypes remains largely unexplored in the context of human multiorgan system diseases. Using publicly available genome-wide association study summary statistics from the UK Biobank (UKBB), FinnGen, and the Psychiatric Genomics Consortium, we comprehensively depicted the genetic architecture of 2024 multiorgan AI endophenotypes (MAEs). We comparatively assessed the single-nucleotide polymorphism-based heritability, polygenicity, and natural selection signatures of 2024 MAEs using methods commonly used in the field. Genetic correlation and Mendelian randomization analyses reveal both within-organ relationships and cross-organ interconnections. Bi-directional causal relationships were established between chronic human diseases and MAEs across multiple organ systems, including Alzheimer's disease for the brain, diabetes for the metabolic system, asthma for the pulmonary system, and hypertension for the cardiovascular system. Finally, we derived polygenic risk scores for the 2024 MAEs for individuals not used to calculate MAEs and returned these to the UKBB. Our findings underscore the promise of the MAEs as new instruments to ameliorate overall human health. All results are encapsulated into the MUlTiorgan AI endophenoTypE genetic atlas and are publicly available at https://labs-laboratory.com/mutate.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938998/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143708594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jair Herazo-Álvarez, Marco Mora, Sara Cuadros-Orellana, Karina Vilches-Ponce, Ruber Hernández-García
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
{"title":"A review of neural networks for metagenomic binning.","authors":"Jair Herazo-Álvarez, Marco Mora, Sara Cuadros-Orellana, Karina Vilches-Ponce, Ruber Hernández-García","doi":"10.1093/bib/bbaf065","DOIUrl":"10.1093/bib/bbaf065","url":null,"abstract":"<p><p>One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934572/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143699297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yahao Wu, Jing Liu, Yanni Xiao, Shuqin Zhang, Limin Li
With the rapid advances in single-cell sequencing technology, it is now feasible to conduct in-depth genetic analysis in individual cells. Study on the dynamics of single cells in response to perturbations is of great significance for understanding the functions and behaviors of living organisms. However, the acquisition of post-perturbation cellular states via biological experiments is frequently cost-prohibitive. Predicting the single-cell perturbation responses poses a critical challenge in the field of computational biology. In this work, we propose a novel deep learning method called coupled variational autoencoders (CoupleVAE), devised to predict the postperturbation single-cell RNA-Seq data. CoupleVAE is composed of two coupled VAEs connected by a coupler, initially extracting latent features for controlled and perturbed cells via two encoders, subsequently engaging in mutual translation within the latent space through two nonlinear mappings via a coupler, and ultimately generating controlled and perturbed data by two separate decoders to process the encoded and translated features. CoupleVAE facilitates a more intricate state transformation of single cells within the latent space. Experiments in three real datasets on infection, stimulation and cross-species prediction show that CoupleVAE surpasses the existing comparative models in effectively predicting single-cell RNA-seq data for perturbed cells, achieving superior accuracy.
{"title":"CoupleVAE: coupled variational autoencoders for predicting perturbational single-cell RNA sequencing data.","authors":"Yahao Wu, Jing Liu, Yanni Xiao, Shuqin Zhang, Limin Li","doi":"10.1093/bib/bbaf126","DOIUrl":"10.1093/bib/bbaf126","url":null,"abstract":"<p><p>With the rapid advances in single-cell sequencing technology, it is now feasible to conduct in-depth genetic analysis in individual cells. Study on the dynamics of single cells in response to perturbations is of great significance for understanding the functions and behaviors of living organisms. However, the acquisition of post-perturbation cellular states via biological experiments is frequently cost-prohibitive. Predicting the single-cell perturbation responses poses a critical challenge in the field of computational biology. In this work, we propose a novel deep learning method called coupled variational autoencoders (CoupleVAE), devised to predict the postperturbation single-cell RNA-Seq data. CoupleVAE is composed of two coupled VAEs connected by a coupler, initially extracting latent features for controlled and perturbed cells via two encoders, subsequently engaging in mutual translation within the latent space through two nonlinear mappings via a coupler, and ultimately generating controlled and perturbed data by two separate decoders to process the encoded and translated features. CoupleVAE facilitates a more intricate state transformation of single cells within the latent space. Experiments in three real datasets on infection, stimulation and cross-species prediction show that CoupleVAE surpasses the existing comparative models in effectively predicting single-cell RNA-seq data for perturbed cells, achieving superior accuracy.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966612/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143771437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid advancement of next-generation sequencing (NGS) technology and the expanding availability of NGS datasets have led to a significant surge in biomedical research. To better understand the molecular processes, underlying cancer and to support its development, diagnosis, prediction, and therapy; NGS data analysis is crucial. However, the NGS multi-layer omics high-dimensional dataset is highly complex. In recent times, some computational methods have been developed for cancer omics data interpretation. However, various existing methods face challenges in accounting for diverse types of cancer omics data and struggle to effectively extract informative features for the integrated identification of core units. To address these challenges, we proposed a hybrid feature selection (HFS) technique to detect optimal features from multi-layer omics datasets. Subsequently, this study proposes a novel hybrid deep recurrent neural network-based model DOMSCNet to classify stomach cancer. The proposed model was made generic for all four multi-layer omics datasets. To observe the robustness of the DOMSCNet model, the proposed model was validated with eight external datasets. Experimental results showed that the SelectKBest-maximum relevancy minimum redundancy-Boruta (SMB), HFS technique outperformed all other HFS techniques. Across four multi-layer omics datasets and validated datasets, the proposed DOMSCNet model outdid existing classifiers along with other proposed classifiers.
{"title":"DOMSCNet: a deep learning model for the classification of stomach cancer using multi-layer omics data.","authors":"Kasmika Borah, Himanish Shekhar Das, Ram Kaji Budhathoki, Khursheed Aurangzeb, Saurav Mallik","doi":"10.1093/bib/bbaf115","DOIUrl":"10.1093/bib/bbaf115","url":null,"abstract":"<p><p>The rapid advancement of next-generation sequencing (NGS) technology and the expanding availability of NGS datasets have led to a significant surge in biomedical research. To better understand the molecular processes, underlying cancer and to support its development, diagnosis, prediction, and therapy; NGS data analysis is crucial. However, the NGS multi-layer omics high-dimensional dataset is highly complex. In recent times, some computational methods have been developed for cancer omics data interpretation. However, various existing methods face challenges in accounting for diverse types of cancer omics data and struggle to effectively extract informative features for the integrated identification of core units. To address these challenges, we proposed a hybrid feature selection (HFS) technique to detect optimal features from multi-layer omics datasets. Subsequently, this study proposes a novel hybrid deep recurrent neural network-based model DOMSCNet to classify stomach cancer. The proposed model was made generic for all four multi-layer omics datasets. To observe the robustness of the DOMSCNet model, the proposed model was validated with eight external datasets. Experimental results showed that the SelectKBest-maximum relevancy minimum redundancy-Boruta (SMB), HFS technique outperformed all other HFS techniques. Across four multi-layer omics datasets and validated datasets, the proposed DOMSCNet model outdid existing classifiers along with other proposed classifiers.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143771445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, Kefeng Li
Understanding causality in medical research is essential for developing effective interventions and diagnostic tools. Mendelian Randomization (MR) is a pivotal method for inferring causality through genetic data. However, MR analysis often requires pre-identification of exposure-outcome pairs from clinical experience or literature, which can be challenging to obtain. This poses difficulties for clinicians investigating causal factors of specific diseases. To address this, we introduce MRAgent, an innovative automated agent leveraging Large Language Models (LLMs) to enhance causal knowledge discovery in disease research. MRAgent autonomously scans scientific literature, discovers potential exposure-outcome pairs, and performs MR causal inference using extensive Genome-Wide Association Study data. We conducted both automated and human evaluations to compare different LLMs in operating MRAgent and provided a proof-of-concept case to demonstrate the complete workflow. MRAgent's capability to conduct large-scale causal analyses represents a significant advancement, equipping researchers and clinicians with a robust tool for exploring and validating causal relationships in complex diseases. Our code is public at https://github.com/xuwei1997/MRAgent.
{"title":"MRAgent: an LLM-based automated agent for causal knowledge discovery in disease via Mendelian randomization.","authors":"Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, Kefeng Li","doi":"10.1093/bib/bbaf140","DOIUrl":"https://doi.org/10.1093/bib/bbaf140","url":null,"abstract":"<p><p>Understanding causality in medical research is essential for developing effective interventions and diagnostic tools. Mendelian Randomization (MR) is a pivotal method for inferring causality through genetic data. However, MR analysis often requires pre-identification of exposure-outcome pairs from clinical experience or literature, which can be challenging to obtain. This poses difficulties for clinicians investigating causal factors of specific diseases. To address this, we introduce MRAgent, an innovative automated agent leveraging Large Language Models (LLMs) to enhance causal knowledge discovery in disease research. MRAgent autonomously scans scientific literature, discovers potential exposure-outcome pairs, and performs MR causal inference using extensive Genome-Wide Association Study data. We conducted both automated and human evaluations to compare different LLMs in operating MRAgent and provided a proof-of-concept case to demonstrate the complete workflow. MRAgent's capability to conduct large-scale causal analyses represents a significant advancement, equipping researchers and clinicians with a robust tool for exploring and validating causal relationships in complex diseases. Our code is public at https://github.com/xuwei1997/MRAgent.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143802516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.
{"title":"Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference.","authors":"Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, Qi Dai, Huiyan Sun","doi":"10.1093/bib/bbaf113","DOIUrl":"10.1093/bib/bbaf113","url":null,"abstract":"<p><p>Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899576/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143613380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sizhe Qiu, Bozhen Hu, Jing Zhao, Weiren Xu, Aidong Yang
An accurate deep learning predictor is needed for enzyme optimal temperature (${T}_{opt}$), which quantitatively describes how temperature affects the enzyme catalytic activity. In comparison with existing models, a new model developed in this study, Seq2Topt, reached a superior accuracy on ${T}_{opt}$ prediction just using protein sequences (RMSE = 12.26°C and R2 = 0.57), and could capture key protein regions for enzyme ${T}_{opt}$ with multi-head attention on residues. Through case studies on thermophilic enzyme selection and predicting enzyme ${T}_{opt}$ shifts caused by point mutations, Seq2Topt was demonstrated as a promising computational tool for enzyme mining and in-silico enzyme design. Additionally, accurate deep learning predictors of enzyme optimal pH (Seq2pHopt, RMSE = 0.88 and R2 = 0.42) and melting temperature (Seq2Tm, RMSE = 7.57 °C and R2 = 0.64) were developed based on the model architecture of Seq2Topt, suggesting that the development of Seq2Topt could potentially give rise to a useful prediction platform of enzymes.
{"title":"Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature.","authors":"Sizhe Qiu, Bozhen Hu, Jing Zhao, Weiren Xu, Aidong Yang","doi":"10.1093/bib/bbaf114","DOIUrl":"10.1093/bib/bbaf114","url":null,"abstract":"<p><p>An accurate deep learning predictor is needed for enzyme optimal temperature (${T}_{opt}$), which quantitatively describes how temperature affects the enzyme catalytic activity. In comparison with existing models, a new model developed in this study, Seq2Topt, reached a superior accuracy on ${T}_{opt}$ prediction just using protein sequences (RMSE = 12.26°C and R2 = 0.57), and could capture key protein regions for enzyme ${T}_{opt}$ with multi-head attention on residues. Through case studies on thermophilic enzyme selection and predicting enzyme ${T}_{opt}$ shifts caused by point mutations, Seq2Topt was demonstrated as a promising computational tool for enzyme mining and in-silico enzyme design. Additionally, accurate deep learning predictors of enzyme optimal pH (Seq2pHopt, RMSE = 0.88 and R2 = 0.42) and melting temperature (Seq2Tm, RMSE = 7.57 °C and R2 = 0.64) were developed based on the model architecture of Seq2Topt, suggesting that the development of Seq2Topt could potentially give rise to a useful prediction platform of enzymes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolaas F V Burger, Vittorio F Nicolis, Anna-Maria Botha
Aphids are a speciose family of the Hemiptera compromising >5500 species. They have adapted to feed off multiple plant species and occur on every continent on Earth. Although economically devastating, very few aphid genomes have been sequenced and assembled, and those that have suffer low contiguity due to repeat-rich and AT-rich genomes. With third-generation sequencing becoming more affordable and approaching quality levels to that of second-generation sequencing, the ability to produce more contiguous aphid genome assemblies is becoming a reality. With a growing list of long-read assemblers becoming available, the choice of which assembly tool to use becomes more complicated. In this study, six recently released long-read assemblers (Canu, Flye, Hifiasm, Mecat2, Raven, and Wtdbg2) were evaluated on several quality and contiguity metrics after assembling four populations (or biotypes) of the same species (Russian wheat aphid, Diuraphis noxia) and two unrelated aphid species that have publicly available long-read sequences. All assemblers did not fare equally well between the different read sets, but, overall, the Hifiasm and Canu assemblers performed the best. Merging of the best assemblies for each read set was also performed using quickmerge, where, in some cases, it resulted in superior assemblies and, in others, introduced more errors. Ab initio gene calling between assemblies of the same read set also showed surprisingly less similarity than expected. Overall, the quality control pipeline followed during the assembly resulted in chromosome-level assemblies with minimal structural or quality artefacts.
{"title":"Evaluating long-read assemblers to assemble several aphididae genomes.","authors":"Nicolaas F V Burger, Vittorio F Nicolis, Anna-Maria Botha","doi":"10.1093/bib/bbaf105","DOIUrl":"10.1093/bib/bbaf105","url":null,"abstract":"<p><p>Aphids are a speciose family of the Hemiptera compromising >5500 species. They have adapted to feed off multiple plant species and occur on every continent on Earth. Although economically devastating, very few aphid genomes have been sequenced and assembled, and those that have suffer low contiguity due to repeat-rich and AT-rich genomes. With third-generation sequencing becoming more affordable and approaching quality levels to that of second-generation sequencing, the ability to produce more contiguous aphid genome assemblies is becoming a reality. With a growing list of long-read assemblers becoming available, the choice of which assembly tool to use becomes more complicated. In this study, six recently released long-read assemblers (Canu, Flye, Hifiasm, Mecat2, Raven, and Wtdbg2) were evaluated on several quality and contiguity metrics after assembling four populations (or biotypes) of the same species (Russian wheat aphid, Diuraphis noxia) and two unrelated aphid species that have publicly available long-read sequences. All assemblers did not fare equally well between the different read sets, but, overall, the Hifiasm and Canu assemblers performed the best. Merging of the best assemblies for each read set was also performed using quickmerge, where, in some cases, it resulted in superior assemblies and, in others, introduced more errors. Ab initio gene calling between assemblies of the same read set also showed surprisingly less similarity than expected. Overall, the quality control pipeline followed during the assembly resulted in chromosome-level assemblies with minimal structural or quality artefacts.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904405/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.
{"title":"Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model.","authors":"Saleh Sereshki, Stefano Lonardi","doi":"10.1093/bib/bbaf092","DOIUrl":"10.1093/bib/bbaf092","url":null,"abstract":"<p><p>DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904404/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}