Pub Date : 2024-12-30eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae184
Orhan Sari, Ziying Liu, Youlian Pan, Xiaojian Shao
Motivation: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas9 system is a ground-breaking genome editing tool, which has revolutionized cell and gene therapies. One of the essential components involved in this system that ensures its success is the design of an optimal single-guide RNA (sgRNA) with high on-target cleavage efficiency and low off-target effects. This is challenging as many conditions need to be considered, and empirically testing every design is time-consuming and costly. In silico prediction using machine learning models provides high-performance alternatives.
Results: We present CrisprBERT, a deep learning model incorporating a Bidirectional Encoder Representations from Transformers (BERT) architecture to provide a high-dimensional embedding for paired sgRNA and DNA sequences and Bidirectional Long Short-term Memory networks for learning, to predict the off-target effects of sgRNAs utilizing only the sgRNAs and their paired DNA sequences. We proposed doublet stack encoding to capture the local energy configuration of the Cas9 binding and applied the BERT model to learn the contextual embedding of the doublet pairs. Our results showed that the new model achieved better performance than state-of-the-art deep learning models regarding single split and leave-one-sgRNA-out cross-validations as well as independent testing.
Availability and implementation: The CrisprBERT is available at GitHub: https://github.com/OSsari/CrisprBERT.
{"title":"Predicting CRISPR-Cas9 off-target effects in human primary cells using bidirectional LSTM with BERT embedding.","authors":"Orhan Sari, Ziying Liu, Youlian Pan, Xiaojian Shao","doi":"10.1093/bioadv/vbae184","DOIUrl":"https://doi.org/10.1093/bioadv/vbae184","url":null,"abstract":"<p><strong>Motivation: </strong>Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas9 system is a ground-breaking genome editing tool, which has revolutionized cell and gene therapies. One of the essential components involved in this system that ensures its success is the design of an optimal single-guide RNA (sgRNA) with high on-target cleavage efficiency and low off-target effects. This is challenging as many conditions need to be considered, and empirically testing every design is time-consuming and costly. <i>In silico</i> prediction using machine learning models provides high-performance alternatives.</p><p><strong>Results: </strong>We present CrisprBERT, a deep learning model incorporating a Bidirectional Encoder Representations from Transformers (BERT) architecture to provide a high-dimensional embedding for paired sgRNA and DNA sequences and Bidirectional Long Short-term Memory networks for learning, to predict the off-target effects of sgRNAs utilizing only the sgRNAs and their paired DNA sequences. We proposed doublet stack encoding to capture the local energy configuration of the Cas9 binding and applied the BERT model to learn the contextual embedding of the doublet pairs. Our results showed that the new model achieved better performance than state-of-the-art deep learning models regarding single split and leave-one-sgRNA-out cross-validations as well as independent testing.</p><p><strong>Availability and implementation: </strong>The CrisprBERT is available at GitHub: https://github.com/OSsari/CrisprBERT.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae184"},"PeriodicalIF":2.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae207
Cyprien A Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N Sheth, Guido J Falcone, Julian N Acosta
Motivation: The expansion of genetic association data from genome-wide association studies has increased the importance of methodologies like Polygenic Risk Scores (PRS) and Mendelian Randomization (MR) in genetic epidemiology. However, their application is often impeded by complex, multi-step workflows requiring specialized expertise and the use of disparate tools with varying data formatting requirements. Existing solutions are frequently standalone packages or command-line based-largely due to dependencies on tools like PLINK-limiting accessibility for researchers without computational experience. Given Python's popularity and ease of use, there is a need for an integrated, user-friendly Python toolkit to streamline PRS and MR analyses.
Results: We introduce Genal, a Python package that consolidates SNP-level data handling, cleaning, clumping, PRS computation, and MR analyses into a single, cohesive toolkit. By eliminating the need for multiple R packages and for command-line interaction by wrapping around PLINK, Genal lowers the barrier for medical scientists to perform complex genetic epidemiology studies. Genal draws on concepts from several well-established tools, ensuring that users have access to rigorous statistical techniques in the intuitive Python environment. Additionally, Genal leverages parallel processing for MR methods, including MR-PRESSO, significantly reducing the computational time required for these analyses.
Availability and implementation: The package is available on Pypi (https://pypi.org/project/genal-python/), the code is openly available on Github with a tutorial: https://github.com/CypRiv/genal, and the documentation can be found on readthedocs: https://genal.rtfd.io.
{"title":"Genal: a Python toolkit for genetic risk scoring and Mendelian randomization.","authors":"Cyprien A Rivier, Santiago Clocchiatti-Tuozzo, Shufan Huo, Victor Torres-Lopez, Daniela Renedo, Kevin N Sheth, Guido J Falcone, Julian N Acosta","doi":"10.1093/bioadv/vbae207","DOIUrl":"https://doi.org/10.1093/bioadv/vbae207","url":null,"abstract":"<p><strong>Motivation: </strong>The expansion of genetic association data from genome-wide association studies has increased the importance of methodologies like Polygenic Risk Scores (PRS) and Mendelian Randomization (MR) in genetic epidemiology. However, their application is often impeded by complex, multi-step workflows requiring specialized expertise and the use of disparate tools with varying data formatting requirements. Existing solutions are frequently standalone packages or command-line based-largely due to dependencies on tools like PLINK-limiting accessibility for researchers without computational experience. Given Python's popularity and ease of use, there is a need for an integrated, user-friendly Python toolkit to streamline PRS and MR analyses.</p><p><strong>Results: </strong>We introduce Genal, a Python package that consolidates SNP-level data handling, cleaning, clumping, PRS computation, and MR analyses into a single, cohesive toolkit. By eliminating the need for multiple R packages and for command-line interaction by wrapping around PLINK, Genal lowers the barrier for medical scientists to perform complex genetic epidemiology studies. Genal draws on concepts from several well-established tools, ensuring that users have access to rigorous statistical techniques in the intuitive Python environment. Additionally, Genal leverages parallel processing for MR methods, including MR-PRESSO, significantly reducing the computational time required for these analyses.</p><p><strong>Availability and implementation: </strong>The package is available on Pypi (https://pypi.org/project/genal-python/), the code is openly available on Github with a tutorial: https://github.com/CypRiv/genal, and the documentation can be found on readthedocs: https://genal.rtfd.io.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae207"},"PeriodicalIF":2.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706532/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142959799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae208
Hoang M Ngo, Tamim Khatib, My T Thai, Tamer Kahveci
Motivation: Network motif identification (MI) problem aims to find topological patterns in biological networks. Identifying disjoint motifs is a computationally challenging problem using classical computers. Quantum computers enable solving high complexity problems which do not scale using classical computers. In this article, we develop the first quantum solution, called QOMIC (Quantum Optimization for Motif IdentifiCation), to the MI problem. QOMIC transforms the MI problem using a integer model, which serves as the foundation to develop our quantum solution. We develop and implement the quantum circuit to find motif locations in the given network using this model.
Results: Our experiments demonstrate that QOMIC outperforms the existing solutions developed for the classical computer, in term of motif counts. We also observe that QOMIC can efficiently find motifs in human regulatory networks associated with five neurodegenerative diseases: Alzheimer's, Parkinson's, Huntington's, Amyotrophic Lateral Sclerosis, and Motor Neurone Disease.
Availability and implementation: Our implementation can be found in https://github.com/ngominhhoang/Quantum-Motif-Identification.git.
{"title":"QOMIC: quantum optimization for motif identification.","authors":"Hoang M Ngo, Tamim Khatib, My T Thai, Tamer Kahveci","doi":"10.1093/bioadv/vbae208","DOIUrl":"10.1093/bioadv/vbae208","url":null,"abstract":"<p><strong>Motivation: </strong>Network motif identification (MI) problem aims to find topological patterns in biological networks. Identifying disjoint motifs is a computationally challenging problem using classical computers. Quantum computers enable solving high complexity problems which do not scale using classical computers. In this article, we develop the first quantum solution, called QOMIC (Quantum Optimization for Motif IdentifiCation), to the MI problem. QOMIC transforms the MI problem using a integer model, which serves as the foundation to develop our quantum solution. We develop and implement the quantum circuit to find motif locations in the given network using this model.</p><p><strong>Results: </strong>Our experiments demonstrate that QOMIC outperforms the existing solutions developed for the classical computer, in term of motif counts. We also observe that QOMIC can efficiently find motifs in human regulatory networks associated with five neurodegenerative diseases: Alzheimer's, Parkinson's, Huntington's, Amyotrophic Lateral Sclerosis, and Motor Neurone Disease.</p><p><strong>Availability and implementation: </strong>Our implementation can be found in https://github.com/ngominhhoang/Quantum-Motif-Identification.git.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae208"},"PeriodicalIF":2.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11725347/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae200
Unmani Jaygude, Graham M Hughes, Jeremy C Simpson
Motivation: Rab GTPases (Rabs) are crucial for membrane trafficking within mammalian cells, and their dysfunction is implicated in many diseases. This gene family plays a role in several crucial cellular processes. Network analyses can uncover the complete repertoire of interaction patterns across the Rab network, informing disease research, opening new opportunities for therapeutic interventions.
Results: We examined Rabs and their interactors in the context of epithelial-to-mesenchymal transition (EMT), an indicator of cancer metastasizing to distant organs. A Rab network was first established from analysis of literature and was gradually expanded. Our Python module, resnet, assessed its network resilience and selected an optimally sized, resilient Rab network for further analyses. Pathway enrichment confirmed its role in EMT. We then identified 73 candidate genes showing a strong up-/down-regulation, across 10 cancer types, in patients with metastasized tumours compared to only primary-site tumours. We suggest that their encoded proteins might play a critical role in EMT, and further in vitro studies are needed to confirm their role as predictive markers of cancer metastasis. The use of resnet within the systematic analysis approach described here can be easily applied to assess other gene families and their role in biological events of interest.
Availability and implementation: Source code for resnet is freely available at https://github.com/Unmani199/resnet.
{"title":"Exploring the role of the Rab network in epithelial-to-mesenchymal transition.","authors":"Unmani Jaygude, Graham M Hughes, Jeremy C Simpson","doi":"10.1093/bioadv/vbae200","DOIUrl":"10.1093/bioadv/vbae200","url":null,"abstract":"<p><strong>Motivation: </strong>Rab GTPases (Rabs) are crucial for membrane trafficking within mammalian cells, and their dysfunction is implicated in many diseases. This gene family plays a role in several crucial cellular processes. Network analyses can uncover the complete repertoire of interaction patterns across the Rab network, informing disease research, opening new opportunities for therapeutic interventions.</p><p><strong>Results: </strong>We examined Rabs and their interactors in the context of epithelial-to-mesenchymal transition (EMT), an indicator of cancer metastasizing to distant organs. A Rab network was first established from analysis of literature and was gradually expanded. Our Python module, <i>resnet</i>, assessed its network resilience and selected an optimally sized, resilient Rab network for further analyses. Pathway enrichment confirmed its role in EMT. We then identified 73 candidate genes showing a strong up-/down-regulation, across 10 cancer types, in patients with metastasized tumours compared to only primary-site tumours. We suggest that their encoded proteins might play a critical role in EMT, and further <i>in vitro</i> studies are needed to confirm their role as predictive markers of cancer metastasis. The use of <i>resnet</i> within the systematic analysis approach described here can be easily applied to assess other gene families and their role in biological events of interest.</p><p><strong>Availability and implementation: </strong>Source code for <i>resnet</i> is freely available at https://github.com/Unmani199/resnet.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae200"},"PeriodicalIF":2.4,"publicationDate":"2024-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142907962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae201
Aurora Maurizio, Anna Sofia Tascini, Marco J Morelli
Motivation: Proteins at the cell surface connect signaling networks and largely determine a cell's capacity to communicate and interact with its environment. In particular, variations in transcriptomic profiles are often observed between healthy and diseased cells, leading to distinct sets of cell-surface proteins. For these reasons, cell-surface proteins may act as biomarkers for the detection of cells of interest in tissues or body fluids, are often the target of pharmaceutical agents, and hold significant promise in the clinical practice for diagnosis, prognosis, treatment development, and evaluation of therapy response. Therefore, implementing robust methods to identify condition-specific cell-surface proteins is of pivotal importance to advance biomedical research.
Results: We developed SurfR, an R/Bioconductor package providing a streamlined end-to-end workflow for computationally identifying surface protein-coding genes from expression data. Our user-friendly, comprehensive workflow performs systematic expression data retrieval from public databases, differential gene expression across conditions, integration of datasets, enrichment analysis, identification of targetable proteins on a condition of interest, and data visualization.
Availability and implementation: SurfR is released under GNU-GPL-v3.0 License. Source code, documentation, examples, and tutorials are available through Bioconductor (http://www.bioconductor.org/packages/SurfR). RMD notebooks with the use cases code described in the manuscript can be found on GitHub (https://github.com/auroramaurizio/SurfR_UseCases).
{"title":"SurfR: Riding the wave of RNA-seq data with a comprehensive bioconductor package to identify surface protein-coding genes.","authors":"Aurora Maurizio, Anna Sofia Tascini, Marco J Morelli","doi":"10.1093/bioadv/vbae201","DOIUrl":"10.1093/bioadv/vbae201","url":null,"abstract":"<p><strong>Motivation: </strong>Proteins at the cell surface connect signaling networks and largely determine a cell's capacity to communicate and interact with its environment. In particular, variations in transcriptomic profiles are often observed between healthy and diseased cells, leading to distinct sets of cell-surface proteins. For these reasons, cell-surface proteins may act as biomarkers for the detection of cells of interest in tissues or body fluids, are often the target of pharmaceutical agents, and hold significant promise in the clinical practice for diagnosis, prognosis, treatment development, and evaluation of therapy response. Therefore, implementing robust methods to identify condition-specific cell-surface proteins is of pivotal importance to advance biomedical research.</p><p><strong>Results: </strong>We developed SurfR, an R/Bioconductor package providing a streamlined end-to-end workflow for computationally identifying surface protein-coding genes from expression data. Our user-friendly, comprehensive workflow performs systematic expression data retrieval from public databases, differential gene expression across conditions, integration of datasets, enrichment analysis, identification of targetable proteins on a condition of interest, and data visualization.</p><p><strong>Availability and implementation: </strong>SurfR is released under GNU-GPL-v3.0 License. Source code, documentation, examples, and tutorials are available through Bioconductor (http://www.bioconductor.org/packages/SurfR). RMD notebooks with the use cases code described in the manuscript can be found on GitHub (https://github.com/auroramaurizio/SurfR_UseCases).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae201"},"PeriodicalIF":2.4,"publicationDate":"2024-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae203
Mahsa Monshizadeh, Yuhui Hong, Yuzhen Ye
Motivation: Microbial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host's gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions.
Results: To address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models.
Availability and implementation: https://github.com/mgtools/MicroKPNN-MT.
{"title":"Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome.","authors":"Mahsa Monshizadeh, Yuhui Hong, Yuzhen Ye","doi":"10.1093/bioadv/vbae203","DOIUrl":"10.1093/bioadv/vbae203","url":null,"abstract":"<p><strong>Motivation: </strong>Microbial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host's gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions.</p><p><strong>Results: </strong>To address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models.</p><p><strong>Availability and implementation: </strong>https://github.com/mgtools/MicroKPNN-MT.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae203"},"PeriodicalIF":2.4,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11676323/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae196
Jenniffer Roa Lozano, Mataya Duncan, Duane D McKenna, Todd A Castoe, Michael DeGiorgio, Richard Adams
Motivation: The scale and scope of comparative trait data are expanding at unprecedented rates, and recent advances in evolutionary modeling and simulation sometimes struggle to match this pace. Well-organized and flexible applications for conducting large-scale simulations of evolution hold promise in this context for understanding models and more so our ability to confidently estimate them with real trait data sampled from nature.
Results: We introduce TraitTrainR, an R package designed to facilitate efficient, large-scale simulations under complex models of continuous trait evolution. TraitTrainR employs several output formats, supports popular trait data transformations, accommodates multi-trait evolution, and exhibits flexibility in defining input parameter space and model stacking. Moreover, TraitTrainR permits measurement error, allowing for investigation of its potential impacts on evolutionary inference. We envision a wealth of applications of TraitTrainR, and we demonstrate one such example by examining the problem of evolutionary model selection in three empirical phylogenetic case studies. Collectively, these demonstrations of applying TraitTrainR to explore problems in model selection underscores its utility and broader promise for addressing key questions, including those related to experimental design and statistical power, in comparative biology.
Availability and implementation: TraitTrainR is developed in R 4.4.0 and is freely available at https://github.com/radamsRHA/TraitTrainR/, which includes detailed documentation, quick-start guides, and a step-by-step tutorial.
{"title":"TraitTrainR: accelerating large-scale simulation under models of continuous trait evolution.","authors":"Jenniffer Roa Lozano, Mataya Duncan, Duane D McKenna, Todd A Castoe, Michael DeGiorgio, Richard Adams","doi":"10.1093/bioadv/vbae196","DOIUrl":"https://doi.org/10.1093/bioadv/vbae196","url":null,"abstract":"<p><strong>Motivation: </strong>The scale and scope of comparative trait data are expanding at unprecedented rates, and recent advances in evolutionary modeling and simulation sometimes struggle to match this pace. Well-organized and flexible applications for conducting large-scale simulations of evolution hold promise in this context for understanding models and more so our ability to confidently estimate them with real trait data sampled from nature.</p><p><strong>Results: </strong>We introduce <i>TraitTrainR</i>, an R package designed to facilitate efficient, large-scale simulations under complex models of continuous trait evolution. <i>TraitTrainR</i> employs several output formats, supports popular trait data transformations, accommodates multi-trait evolution, and exhibits flexibility in defining input parameter space and model stacking. Moreover, <i>TraitTrainR</i> permits measurement error, allowing for investigation of its potential impacts on evolutionary inference. We envision a wealth of applications of <i>TraitTrainR</i>, and we demonstrate one such example by examining the problem of evolutionary model selection in three empirical phylogenetic case studies. Collectively, these demonstrations of applying <i>TraitTrainR</i> to explore problems in model selection underscores its utility and broader promise for addressing key questions, including those related to experimental design and statistical power, in comparative biology.</p><p><strong>Availability and implementation: </strong><i>TraitTrainR</i> is developed in R 4.4.0 and is freely available at https://github.com/radamsRHA/TraitTrainR/, which includes detailed documentation, quick-start guides, and a step-by-step tutorial.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae196"},"PeriodicalIF":2.4,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-06eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae195
Mikel Martinez-Goikoetxea
Motivation: Coiled coils are a widespread structural motif consisting of multiple α-helices that wind around a central axis to bury their hydrophobic core. While AlphaFold has emerged as an effective coiled-coil modeling tool, capable of accurately predicting changes in periodicity and core geometry along coiled-coil stalks, it is not without limitations, such as the generation of spuriously bent models and the inability to effectively model globally non-canonical-coiled coils. To overcome these limitations, we investigated whether dividing full-length sequences into fragments would result in better models.
Results: We developed CCfrag to leverage AlphaFold for the piece-wise modeling of coiled coils. The user can create a specification, defined by window size, length of overlap, and oligomerization state, and the program produces the files necessary to run AlphaFold predictions. The structural models and their scores are then integrated into a rich per-residue representation defined by sequence- or structure-based features. Our results suggest that removing coiled-coil sequences from their native context can improve prediction confidence and results in better models. In this article, we present various use cases of CCfrag and propose that fragment-based prediction is useful for understanding the properties of long, fibrous coiled coils by revealing local features not seen in full-length models.
Availability and implementation: The program is implemented as a Python module. The code and its documentation are available at https://github.com/Mikel-MG/CCfrag.
{"title":"CCfrag: scanning folding potential of coiled-coil fragments with AlphaFold.","authors":"Mikel Martinez-Goikoetxea","doi":"10.1093/bioadv/vbae195","DOIUrl":"10.1093/bioadv/vbae195","url":null,"abstract":"<p><strong>Motivation: </strong>Coiled coils are a widespread structural motif consisting of multiple α-helices that wind around a central axis to bury their hydrophobic core. While AlphaFold has emerged as an effective coiled-coil modeling tool, capable of accurately predicting changes in periodicity and core geometry along coiled-coil stalks, it is not without limitations, such as the generation of spuriously bent models and the inability to effectively model globally non-canonical-coiled coils. To overcome these limitations, we investigated whether dividing full-length sequences into fragments would result in better models.</p><p><strong>Results: </strong>We developed CCfrag to leverage AlphaFold for the piece-wise modeling of coiled coils. The user can create a specification, defined by window size, length of overlap, and oligomerization state, and the program produces the files necessary to run AlphaFold predictions. The structural models and their scores are then integrated into a rich per-residue representation defined by sequence- or structure-based features. Our results suggest that removing coiled-coil sequences from their native context can improve prediction confidence and results in better models. In this article, we present various use cases of CCfrag and propose that fragment-based prediction is useful for understanding the properties of long, fibrous coiled coils by revealing local features not seen in full-length models.</p><p><strong>Availability and implementation: </strong>The program is implemented as a Python module. The code and its documentation are available at https://github.com/Mikel-MG/CCfrag.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae195"},"PeriodicalIF":2.4,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11676326/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-06eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae197
Khondamir R Rustamov, Artyom Y Baev
Motivation: Understanding the conformational landscape of protein-ligand interactions is critical for elucidating the binding mechanisms that govern these interactions. Traditional methods like molecular dynamics (MD) simulations are computationally intensive, leading to a demand for more efficient approaches. This study explores how multiple sequence alignment (MSA) clustering enhance AF-Multimer's ability to predict conformational landscapes, particularly for proteins with multiple conformational states.
Results: We verified this approach by predicting the conformational landscapes of chemokine receptor 4 (CXCR4) and glucagon receptor (GCGR) in the presence of their agonists and antagonists. In our experiments, AF-Multimer predicted the structures of CXCR4 and GCGR predominantly in active state in the presence of agonists and in inactive state in the presence of antagonists. Moreover, we tested our approach with proteins known to switch between monomeric and dimeric states, such as lymphotactin, SH3, and thermonuclease. AFcluster-Multimer accurately predicted conformational states during oligomerization, which AFcluster with AlphaFold2 alone fails to achieve. In conclusion, MSA clustering enhances AF-Multimer's ability to predict protein conformational landscapes and mechanistic effects of ligand binding, offering a robust tool for understanding protein-ligand interactions.
Availability and implementation: Code for running AFcluster-Multimer is available at https://github.com/KhondamirRustamov/AF-Multimer-cluster.
{"title":"MSA clustering enhances AF-Multimer's ability to predict conformational landscapes of protein-protein interactions.","authors":"Khondamir R Rustamov, Artyom Y Baev","doi":"10.1093/bioadv/vbae197","DOIUrl":"10.1093/bioadv/vbae197","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the conformational landscape of protein-ligand interactions is critical for elucidating the binding mechanisms that govern these interactions. Traditional methods like molecular dynamics (MD) simulations are computationally intensive, leading to a demand for more efficient approaches. This study explores how multiple sequence alignment (MSA) clustering enhance AF-Multimer's ability to predict conformational landscapes, particularly for proteins with multiple conformational states.</p><p><strong>Results: </strong>We verified this approach by predicting the conformational landscapes of chemokine receptor 4 (CXCR4) and glucagon receptor (GCGR) in the presence of their agonists and antagonists. In our experiments, AF-Multimer predicted the structures of CXCR4 and GCGR predominantly in active state in the presence of agonists and in inactive state in the presence of antagonists. Moreover, we tested our approach with proteins known to switch between monomeric and dimeric states, such as lymphotactin, SH3, and thermonuclease. AFcluster-Multimer accurately predicted conformational states during oligomerization, which AFcluster with AlphaFold2 alone fails to achieve. In conclusion, MSA clustering enhances AF-Multimer's ability to predict protein conformational landscapes and mechanistic effects of ligand binding, offering a robust tool for understanding protein-ligand interactions.</p><p><strong>Availability and implementation: </strong>Code for running AFcluster-Multimer is available at https://github.com/KhondamirRustamov/AF-Multimer-cluster.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae197"},"PeriodicalIF":2.4,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671036/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-05eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae194
Ting Gao, Xue Zhai, Chuan Yang, Linlin Lv, Han Wang
Motivation: Joint extraction of entity and relation is an important research direction in Information Extraction. The number of scientific and technological biomedical literature is rapidly increasing, so automatically extracting entities and their relations from these literatures are key tasks to promote the progress of biomedical research.
Results: The joint extraction of entity and relation model achieves both intra-sentence extraction and cross-sentence extraction, alleviating the problem of long-distance information dependence in long literature. Joint extraction of entity and relation model incorporates a variety of advanced deep learning techniques in this paper: (i) a fine-tuning BERT text classification pre-training model, (ii) Graph Convolutional Network learning method, (iii) Robust Learning Against Textual Label Noise with Self-Mixup Training, (iv) Local regularization Conditional Random Fields. The model implements the following functions: identifying entities from complex biomedical literature effectively, extracting triples within and across sentences, reducing the effect of noisy data during training, and improving the robustness and accuracy of the model. The experiment results prove that the model performs well on the self-built BM_GBD dataset and public datasets, enabling precise large language model enhanced knowledge graph construction for biomedical tasks.
Availability and implementation: The model and partial code are available on GitHub at https://github.com/zhaix922/Joint-extraction-of-entity-and-relation.
{"title":"Joint extraction of entity and relation based on fine-tuning BERT for long biomedical literatures.","authors":"Ting Gao, Xue Zhai, Chuan Yang, Linlin Lv, Han Wang","doi":"10.1093/bioadv/vbae194","DOIUrl":"10.1093/bioadv/vbae194","url":null,"abstract":"<p><strong>Motivation: </strong>Joint extraction of entity and relation is an important research direction in Information Extraction. The number of scientific and technological biomedical literature is rapidly increasing, so automatically extracting entities and their relations from these literatures are key tasks to promote the progress of biomedical research.</p><p><strong>Results: </strong>The joint extraction of entity and relation model achieves both intra-sentence extraction and cross-sentence extraction, alleviating the problem of long-distance information dependence in long literature. Joint extraction of entity and relation model incorporates a variety of advanced deep learning techniques in this paper: (i) a fine-tuning BERT text classification pre-training model, (ii) Graph Convolutional Network learning method, (iii) Robust Learning Against Textual Label Noise with Self-Mixup Training, (iv) Local regularization Conditional Random Fields. The model implements the following functions: identifying entities from complex biomedical literature effectively, extracting triples within and across sentences, reducing the effect of noisy data during training, and improving the robustness and accuracy of the model. The experiment results prove that the model performs well on the self-built BM_GBD dataset and public datasets, enabling precise large language model enhanced knowledge graph construction for biomedical tasks.</p><p><strong>Availability and implementation: </strong>The model and partial code are available on GitHub at https://github.com/zhaix922/Joint-extraction-of-entity-and-relation.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae194"},"PeriodicalIF":2.4,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}