Motivation: The human leukocyte antigen (HLA) system is the main cause of organ transplant loss through the recognition of HLAs present on the graft by donor-specific antibodies raised by the recipient. It is therefore of key importance to identify all potentially immunogenic B-cell epitopes on HLAs in order to refine organ allocation. Such HLAs epitopes are currently characterized by the presence of polymorphic residues called "eplets". However, many polymorphic positions in HLAs sequences are not yet experimentally confirmed as eplets associated with a HLA epitope. Moreover, structural studies of these epitopes only consider 3D static structures.
Results: We present here a machine-learning approach for predicting HLA epitopes, based on 3D-surface patches and molecular dynamics simulations. A collection of 3D-surface patches labeled as Epitope (2117) or Nonepitope (4769) according to Human Leukocyte Antigen Eplet Registry information was derived from 207 HLAs (61 solved and 146 predicted structures). Descriptors derived from static and dynamic patch properties were computed and three tree-based models were trained on a reduced non-redundant dataset. HLA-Epicheck is the prediction system formed by the three models. It leverages dynamic descriptors of 3D-surface patches for more than half of its prediction performance. Epitope predictions on unconfirmed eplets (absent from the initial dataset) are compared with experimental results and notable consistency is found.
Availability and implementation: Structural data and MD trajectories are deposited as open data under doi: 10.57745/GXZHH8. In-house scripts and machine-learning models for HLA-EpiCheck are available from https://gitlab.inria.fr/capsid.public_codes/hla-epicheck.
{"title":"HLA-EpiCheck: novel approach for HLA B-cell epitope prediction using 3D-surface patch descriptors derived from molecular dynamic simulations.","authors":"Diego Amaya-Ramirez, Magali Devriese, Romain Lhotte, Cédric Usureau, Malika Smaïl-Tabbone, Jean-Luc Taupin, Marie-Dominique Devignes","doi":"10.1093/bioadv/vbae186","DOIUrl":"10.1093/bioadv/vbae186","url":null,"abstract":"<p><strong>Motivation: </strong>The human leukocyte antigen (HLA) system is the main cause of organ transplant loss through the recognition of HLAs present on the graft by donor-specific antibodies raised by the recipient. It is therefore of key importance to identify all potentially immunogenic B-cell epitopes on HLAs in order to refine organ allocation. Such HLAs epitopes are currently characterized by the presence of polymorphic residues called \"eplets\". However, many polymorphic positions in HLAs sequences are not yet experimentally confirmed as eplets associated with a HLA epitope. Moreover, structural studies of these epitopes only consider 3D static structures.</p><p><strong>Results: </strong>We present here a machine-learning approach for predicting HLA epitopes, based on 3D-surface patches and molecular dynamics simulations. A collection of 3D-surface patches labeled as Epitope (2117) or Nonepitope (4769) according to Human Leukocyte Antigen Eplet Registry information was derived from 207 HLAs (61 solved and 146 predicted structures). Descriptors derived from static and dynamic patch properties were computed and three tree-based models were trained on a reduced non-redundant dataset. HLA-Epicheck is the prediction system formed by the three models. It leverages dynamic descriptors of 3D-surface patches for more than half of its prediction performance. Epitope predictions on unconfirmed eplets (absent from the initial dataset) are compared with experimental results and notable consistency is found.</p><p><strong>Availability and implementation: </strong>Structural data and MD trajectories are deposited as open data under doi: 10.57745/GXZHH8. In-house scripts and machine-learning models for HLA-EpiCheck are available from https://gitlab.inria.fr/capsid.public_codes/hla-epicheck.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae186"},"PeriodicalIF":2.4,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae170
Perry T Wasdin, Alexandra A Abu-Shmais, Michael W Irvin, Matthew J Vukovich, Ivelin S Georgiev
Motivation: LIBRA-seq (linking B cell receptor to antigen specificity by sequencing) provides a powerful tool for interrogating the antigen-specific B cell compartment and identifying antibodies against antigen targets of interest. Identification of noise in single-cell B cell receptor sequencing data, such as LIBRA-seq, is critical for improving antigen binding predictions for downstream applications including antibody discovery and machine learning technologies.
Results: In this study, we present a method for denoising LIBRA-seq data by clustering antigen counts into signal and noise components with a negative binomial mixture model. This approach leverages single-cell sequencing reads from a large, multi-donor dataset described in a recent LIBRA-seq study to develop a data-driven means for identification of technical noise. We apply this method to nine donors representing separate LIBRA-seq experiments and show that our approach provides improved predictions for in vitro antibody-antigen binding when compared to the standard scoring method, despite variance in data size and noise structure across samples. This development will improve the ability of LIBRA-seq to identify antigen-specific B cells and contribute to providing more reliable datasets for machine learning based approaches as the corpus of single-cell B cell sequencing data continues to grow.
Availability and implementation: All data and code are available at https://github.com/IGlab-VUMC/mixture_model_denoising.
{"title":"Negative binomial mixture model for identification of noise in antibody-antigen specificity predictions from single-cell data.","authors":"Perry T Wasdin, Alexandra A Abu-Shmais, Michael W Irvin, Matthew J Vukovich, Ivelin S Georgiev","doi":"10.1093/bioadv/vbae170","DOIUrl":"10.1093/bioadv/vbae170","url":null,"abstract":"<p><strong>Motivation: </strong>LIBRA-seq (linking B cell receptor to antigen specificity by sequencing) provides a powerful tool for interrogating the antigen-specific B cell compartment and identifying antibodies against antigen targets of interest. Identification of noise in single-cell B cell receptor sequencing data, such as LIBRA-seq, is critical for improving antigen binding predictions for downstream applications including antibody discovery and machine learning technologies.</p><p><strong>Results: </strong>In this study, we present a method for denoising LIBRA-seq data by clustering antigen counts into signal and noise components with a negative binomial mixture model. This approach leverages single-cell sequencing reads from a large, multi-donor dataset described in a recent LIBRA-seq study to develop a data-driven means for identification of technical noise. We apply this method to nine donors representing separate LIBRA-seq experiments and show that our approach provides improved predictions for <i>in vitro</i> antibody-antigen binding when compared to the standard scoring method, despite variance in data size and noise structure across samples. This development will improve the ability of LIBRA-seq to identify antigen-specific B cells and contribute to providing more reliable datasets for machine learning based approaches as the corpus of single-cell B cell sequencing data continues to grow.</p><p><strong>Availability and implementation: </strong>All data and code are available at https://github.com/IGlab-VUMC/mixture_model_denoising.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae170"},"PeriodicalIF":2.4,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae193
Arjun Srivatsa, Russell Schwartz
Motivation: Genomic biotechnology has rapidly advanced, allowing for the inference and modification of genetic and epigenetic information at the single-cell level. While these tools hold enormous potential for basic and clinical research, they also raise difficult issues of how to design studies to deploy them most effectively. In designing a genomic study, a modern researcher might combine many sequencing modalities and sampling protocols, each with different utility, costs, and other tradeoffs. This is especially relevant for studies of somatic variation, which may involve highly heterogeneous cell populations whose differences can be probed via an extensive set of biotechnological tools. Efficiently deploying genomic technologies in this space will require principled ways to create study designs that recover desired genomic information while minimizing various measures of cost.
Results: The central problem this paper attempts to address is how one might create an optimal study design for a genomic analysis, with particular focus on studies involving somatic variation that occur most often with application to cancer genomics. We pose the study design problem as a stochastic constrained nonlinear optimization problem. We introduce a Bayesian optimization framework that iteratively optimizes for an objective function using surrogate modeling combined with pattern and gradient search. We demonstrate our procedure on several test cases to derive resource and study design allocations optimized for various goals and criteria, demonstrating its ability to optimize study designs efficiently across diverse scenarios.
Availability and implementation: https://github.com/CMUSchwartzLab/StudyDesignOptimization.
{"title":"Optimizing design of genomics studies for clonal evolution analysis.","authors":"Arjun Srivatsa, Russell Schwartz","doi":"10.1093/bioadv/vbae193","DOIUrl":"10.1093/bioadv/vbae193","url":null,"abstract":"<p><strong>Motivation: </strong>Genomic biotechnology has rapidly advanced, allowing for the inference and modification of genetic and epigenetic information at the single-cell level. While these tools hold enormous potential for basic and clinical research, they also raise difficult issues of how to design studies to deploy them most effectively. In designing a genomic study, a modern researcher might combine many sequencing modalities and sampling protocols, each with different utility, costs, and other tradeoffs. This is especially relevant for studies of somatic variation, which may involve highly heterogeneous cell populations whose differences can be probed <i>via</i> an extensive set of biotechnological tools. Efficiently deploying genomic technologies in this space will require principled ways to create study designs that recover desired genomic information while minimizing various measures of cost.</p><p><strong>Results: </strong>The central problem this paper attempts to address is how one might create an optimal study design for a genomic analysis, with particular focus on studies involving somatic variation that occur most often with application to cancer genomics. We pose the study design problem as a stochastic constrained nonlinear optimization problem. We introduce a Bayesian optimization framework that iteratively optimizes for an objective function using surrogate modeling combined with pattern and gradient search. We demonstrate our procedure on several test cases to derive resource and study design allocations optimized for various goals and criteria, demonstrating its ability to optimize study designs efficiently across diverse scenarios.</p><p><strong>Availability and implementation: </strong>https://github.com/CMUSchwartzLab/StudyDesignOptimization.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae193"},"PeriodicalIF":2.4,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11645549/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae190
My-Diem Nguyen Pham, Chinh Tran-To Su, Thanh-Nhan Nguyen, Hoai-Nghia Nguyen, Dinh Duy An Nguyen, Hoa Giang, Dinh-Thuc Nguyen, Minh-Duy Phan, Vy Nguyen
Motivation: The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability.
Results: We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy.
Availability and implementation: epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).
{"title":"epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction.","authors":"My-Diem Nguyen Pham, Chinh Tran-To Su, Thanh-Nhan Nguyen, Hoai-Nghia Nguyen, Dinh Duy An Nguyen, Hoa Giang, Dinh-Thuc Nguyen, Minh-Duy Phan, Vy Nguyen","doi":"10.1093/bioadv/vbae190","DOIUrl":"10.1093/bioadv/vbae190","url":null,"abstract":"<p><strong>Motivation: </strong>The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability.</p><p><strong>Results: </strong>We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy.</p><p><strong>Availability and implementation: </strong>epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae190"},"PeriodicalIF":2.4,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11646569/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-28eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae191
Matko Glunčić, Domjan Barić, Vladimir Paar
Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.
Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.
Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.
{"title":"Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm.","authors":"Matko Glunčić, Domjan Barić, Vladimir Paar","doi":"10.1093/bioadv/vbae191","DOIUrl":"10.1093/bioadv/vbae191","url":null,"abstract":"<p><strong>Motivation: </strong>Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.</p><p><strong>Results: </strong>In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.</p><p><strong>Availability and implementation: </strong>Source code and example binaries freely available for download at github.com/gluncic/GRM2023.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae191"},"PeriodicalIF":2.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae192
Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh
Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.
Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D 1H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.
Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.
{"title":"Protomix: a Python package for <sup>1</sup>H-NMR metabolomics data preprocessing.","authors":"Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh","doi":"10.1093/bioadv/vbae192","DOIUrl":"10.1093/bioadv/vbae192","url":null,"abstract":"<p><strong>Motivation: </strong>NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.</p><p><strong>Results: </strong>This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D <sup>1</sup>H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.</p><p><strong>Availability and implementation: </strong>Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae192"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.
Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.
Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.
{"title":"wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs.","authors":"Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbae189","DOIUrl":"10.1093/bioadv/vbae189","url":null,"abstract":"<p><strong>Motivation: </strong>Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.</p><p><strong>Results: </strong>In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.</p><p><strong>Availability and implementation: </strong>DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae189"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11634537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae188
Francesco Costa, Matthias Blum, Alex Bateman
Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.
Results: Our work shows that about one-third of the time structures with a low plDDT can be "rescued," moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.
Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.
{"title":"Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models.","authors":"Francesco Costa, Matthias Blum, Alex Bateman","doi":"10.1093/bioadv/vbae188","DOIUrl":"10.1093/bioadv/vbae188","url":null,"abstract":"<p><strong>Motivation: </strong>High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.</p><p><strong>Results: </strong>Our work shows that about one-third of the time structures with a low plDDT can be \"rescued,\" moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.</p><p><strong>Availability and implementation: </strong>We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae188"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae180
Gutama Ibrahim Mohammad, Tom Michoel
Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.
Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.
Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.
{"title":"Predicting the genetic component of gene expression using gene regulatory networks.","authors":"Gutama Ibrahim Mohammad, Tom Michoel","doi":"10.1093/bioadv/vbae180","DOIUrl":"10.1093/bioadv/vbae180","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.</p><p><strong>Results: </strong>Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.</p><p><strong>Availability and implementation: </strong>The code is available on Github at github.com/guutama/GRN-TI.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae180"},"PeriodicalIF":2.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665636/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbae182
Nure Tasnina, T M Murali
Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.
Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called "co-attention" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.
Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.
{"title":"ICoN: integration using co-attention across biological networks.","authors":"Nure Tasnina, T M Murali","doi":"10.1093/bioadv/vbae182","DOIUrl":"10.1093/bioadv/vbae182","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.</p><p><strong>Results: </strong>We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called \"co-attention\" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.</p><p><strong>Availability and implementation: </strong>The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae182"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}