Bioinformatics advances最新文献_第2页

HLA-EpiCheck: novel approach for HLA B-cell epitope prediction using 3D-surface patch descriptors derived from molecular dynamic simulations.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-12-05 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae186

Diego Amaya-Ramirez, Magali Devriese, Romain Lhotte, Cédric Usureau, Malika Smaïl-Tabbone, Jean-Luc Taupin, Marie-Dominique Devignes

Motivation: The human leukocyte antigen (HLA) system is the main cause of organ transplant loss through the recognition of HLAs present on the graft by donor-specific antibodies raised by the recipient. It is therefore of key importance to identify all potentially immunogenic B-cell epitopes on HLAs in order to refine organ allocation. Such HLAs epitopes are currently characterized by the presence of polymorphic residues called "eplets". However, many polymorphic positions in HLAs sequences are not yet experimentally confirmed as eplets associated with a HLA epitope. Moreover, structural studies of these epitopes only consider 3D static structures.

Results: We present here a machine-learning approach for predicting HLA epitopes, based on 3D-surface patches and molecular dynamics simulations. A collection of 3D-surface patches labeled as Epitope (2117) or Nonepitope (4769) according to Human Leukocyte Antigen Eplet Registry information was derived from 207 HLAs (61 solved and 146 predicted structures). Descriptors derived from static and dynamic patch properties were computed and three tree-based models were trained on a reduced non-redundant dataset. HLA-Epicheck is the prediction system formed by the three models. It leverages dynamic descriptors of 3D-surface patches for more than half of its prediction performance. Epitope predictions on unconfirmed eplets (absent from the initial dataset) are compared with experimental results and notable consistency is found.

Availability and implementation: Structural data and MD trajectories are deposited as open data under doi: 10.57745/GXZHH8. In-house scripts and machine-learning models for HLA-EpiCheck are available from https://gitlab.inria.fr/capsid.public_codes/hla-epicheck.

{"title":"HLA-EpiCheck: novel approach for HLA B-cell epitope prediction using 3D-surface patch descriptors derived from molecular dynamic simulations.","authors":"Diego Amaya-Ramirez, Magali Devriese, Romain Lhotte, Cédric Usureau, Malika Smaïl-Tabbone, Jean-Luc Taupin, Marie-Dominique Devignes","doi":"10.1093/bioadv/vbae186","DOIUrl":"10.1093/bioadv/vbae186","url":null,"abstract":"Motivation: The human leukocyte antigen (HLA) system is the main cause of organ transplant loss through the recognition of HLAs present on the graft by donor-specific antibodies raised by the recipient. It is therefore of key importance to identify all potentially immunogenic B-cell epitopes on HLAs in order to refine organ allocation. Such HLAs epitopes are currently characterized by the presence of polymorphic residues called \"eplets\". However, many polymorphic positions in HLAs sequences are not yet experimentally confirmed as eplets associated with a HLA epitope. Moreover, structural studies of these epitopes only consider 3D static structures.Results: We present here a machine-learning approach for predicting HLA epitopes, based on 3D-surface patches and molecular dynamics simulations. A collection of 3D-surface patches labeled as Epitope (2117) or Nonepitope (4769) according to Human Leukocyte Antigen Eplet Registry information was derived from 207 HLAs (61 solved and 146 predicted structures). Descriptors derived from static and dynamic patch properties were computed and three tree-based models were trained on a reduced non-redundant dataset. HLA-Epicheck is the prediction system formed by the three models. It leverages dynamic descriptors of 3D-surface patches for more than half of its prediction performance. Epitope predictions on unconfirmed eplets (absent from the initial dataset) are compared with experimental results and notable consistency is found.Availability and implementation: Structural data and MD trajectories are deposited as open data under doi: 10.57745/GXZHH8. In-house scripts and machine-learning models for HLA-EpiCheck are available from https://gitlab.inria.fr/capsid.public_codes/hla-epicheck.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae186"},"PeriodicalIF":2.4,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Negative binomial mixture model for identification of noise in antibody-antigen specificity predictions from single-cell data.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-12-04 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae170

Perry T Wasdin, Alexandra A Abu-Shmais, Michael W Irvin, Matthew J Vukovich, Ivelin S Georgiev

Motivation: LIBRA-seq (linking B cell receptor to antigen specificity by sequencing) provides a powerful tool for interrogating the antigen-specific B cell compartment and identifying antibodies against antigen targets of interest. Identification of noise in single-cell B cell receptor sequencing data, such as LIBRA-seq, is critical for improving antigen binding predictions for downstream applications including antibody discovery and machine learning technologies.

Results: In this study, we present a method for denoising LIBRA-seq data by clustering antigen counts into signal and noise components with a negative binomial mixture model. This approach leverages single-cell sequencing reads from a large, multi-donor dataset described in a recent LIBRA-seq study to develop a data-driven means for identification of technical noise. We apply this method to nine donors representing separate LIBRA-seq experiments and show that our approach provides improved predictions for in vitro antibody-antigen binding when compared to the standard scoring method, despite variance in data size and noise structure across samples. This development will improve the ability of LIBRA-seq to identify antigen-specific B cells and contribute to providing more reliable datasets for machine learning based approaches as the corpus of single-cell B cell sequencing data continues to grow.

Availability and implementation: All data and code are available at https://github.com/IGlab-VUMC/mixture_model_denoising.

{"title":"Negative binomial mixture model for identification of noise in antibody-antigen specificity predictions from single-cell data.","authors":"Perry T Wasdin, Alexandra A Abu-Shmais, Michael W Irvin, Matthew J Vukovich, Ivelin S Georgiev","doi":"10.1093/bioadv/vbae170","DOIUrl":"10.1093/bioadv/vbae170","url":null,"abstract":"Motivation: LIBRA-seq (linking B cell receptor to antigen specificity by sequencing) provides a powerful tool for interrogating the antigen-specific B cell compartment and identifying antibodies against antigen targets of interest. Identification of noise in single-cell B cell receptor sequencing data, such as LIBRA-seq, is critical for improving antigen binding predictions for downstream applications including antibody discovery and machine learning technologies.Results: In this study, we present a method for denoising LIBRA-seq data by clustering antigen counts into signal and noise components with a negative binomial mixture model. This approach leverages single-cell sequencing reads from a large, multi-donor dataset described in a recent LIBRA-seq study to develop a data-driven means for identification of technical noise. We apply this method to nine donors representing separate LIBRA-seq experiments and show that our approach provides improved predictions for in vitro antibody-antigen binding when compared to the standard scoring method, despite variance in data size and noise structure across samples. This development will improve the ability of LIBRA-seq to identify antigen-specific B cells and contribute to providing more reliable datasets for machine learning based approaches as the corpus of single-cell B cell sequencing data continues to grow.Availability and implementation: All data and code are available at https://github.com/IGlab-VUMC/mixture_model_denoising.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae170"},"PeriodicalIF":2.4,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing design of genomics studies for clonal evolution analysis. 优化基因组学研究设计，促进克隆进化分析。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-12-02 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae193

Arjun Srivatsa, Russell Schwartz

Motivation: Genomic biotechnology has rapidly advanced, allowing for the inference and modification of genetic and epigenetic information at the single-cell level. While these tools hold enormous potential for basic and clinical research, they also raise difficult issues of how to design studies to deploy them most effectively. In designing a genomic study, a modern researcher might combine many sequencing modalities and sampling protocols, each with different utility, costs, and other tradeoffs. This is especially relevant for studies of somatic variation, which may involve highly heterogeneous cell populations whose differences can be probed via an extensive set of biotechnological tools. Efficiently deploying genomic technologies in this space will require principled ways to create study designs that recover desired genomic information while minimizing various measures of cost.

Results: The central problem this paper attempts to address is how one might create an optimal study design for a genomic analysis, with particular focus on studies involving somatic variation that occur most often with application to cancer genomics. We pose the study design problem as a stochastic constrained nonlinear optimization problem. We introduce a Bayesian optimization framework that iteratively optimizes for an objective function using surrogate modeling combined with pattern and gradient search. We demonstrate our procedure on several test cases to derive resource and study design allocations optimized for various goals and criteria, demonstrating its ability to optimize study designs efficiently across diverse scenarios.

Availability and implementation: https://github.com/CMUSchwartzLab/StudyDesignOptimization.

动因：基因组生物技术发展迅速，可以在单细胞水平上推断和修改基因和表观遗传信息。这些工具为基础和临床研究带来了巨大的潜力，但同时也提出了如何设计研究以最有效地利用这些工具的难题。在设计基因组研究时，现代研究人员可能会结合多种测序模式和取样方案，每种模式和方案都有不同的效用、成本和其他权衡因素。这一点与体细胞变异研究尤为相关，因为体细胞变异研究可能涉及高度异质性的细胞群，而这些细胞群的差异可以通过一系列广泛的生物技术工具进行探测。要在这一领域有效地部署基因组技术，就需要有原则性的研究设计方法，既能恢复所需的基因组信息，又能最大限度地降低各种成本：本文试图解决的核心问题是如何为基因组分析创建最佳研究设计，尤其关注涉及体细胞变异的研究，这种变异在癌症基因组学应用中最为常见。我们将研究设计问题视为一个随机约束非线性优化问题。我们介绍了一种贝叶斯优化框架，该框架利用代用模型结合模式和梯度搜索对目标函数进行迭代优化。我们在几个测试案例中演示了我们的程序，得出了针对各种目标和标准进行优化的资源和研究设计分配，证明了它在各种情况下高效优化研究设计的能力。可用性和实现：https://github.com/CMUSchwartzLab/StudyDesignOptimization。

{"title":"Optimizing design of genomics studies for clonal evolution analysis.","authors":"Arjun Srivatsa, Russell Schwartz","doi":"10.1093/bioadv/vbae193","DOIUrl":"10.1093/bioadv/vbae193","url":null,"abstract":"Motivation: Genomic biotechnology has rapidly advanced, allowing for the inference and modification of genetic and epigenetic information at the single-cell level. While these tools hold enormous potential for basic and clinical research, they also raise difficult issues of how to design studies to deploy them most effectively. In designing a genomic study, a modern researcher might combine many sequencing modalities and sampling protocols, each with different utility, costs, and other tradeoffs. This is especially relevant for studies of somatic variation, which may involve highly heterogeneous cell populations whose differences can be probed via an extensive set of biotechnological tools. Efficiently deploying genomic technologies in this space will require principled ways to create study designs that recover desired genomic information while minimizing various measures of cost.Results: The central problem this paper attempts to address is how one might create an optimal study design for a genomic analysis, with particular focus on studies involving somatic variation that occur most often with application to cancer genomics. We pose the study design problem as a stochastic constrained nonlinear optimization problem. We introduce a Bayesian optimization framework that iteratively optimizes for an objective function using surrogate modeling combined with pattern and gradient search. We demonstrate our procedure on several test cases to derive resource and study design allocations optimized for various goals and criteria, demonstrating its ability to optimize study designs efficiently across diverse scenarios.Availability and implementation: https://github.com/CMUSchwartzLab/StudyDesignOptimization.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae193"},"PeriodicalIF":2.4,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11645549/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction. epiTCR-KDA：用于 TCR 肽预测的二面角知识蒸馏模型。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-29 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae190

My-Diem Nguyen Pham, Chinh Tran-To Su, Thanh-Nhan Nguyen, Hoai-Nghia Nguyen, Dinh Duy An Nguyen, Hoa Giang, Dinh-Thuc Nguyen, Minh-Duy Phan, Vy Nguyen

Motivation: The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability.

Results: We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy.

Availability and implementation: epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).

{"title":"epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction.","authors":"My-Diem Nguyen Pham, Chinh Tran-To Su, Thanh-Nhan Nguyen, Hoai-Nghia Nguyen, Dinh Duy An Nguyen, Hoa Giang, Dinh-Thuc Nguyen, Minh-Duy Phan, Vy Nguyen","doi":"10.1093/bioadv/vbae190","DOIUrl":"10.1093/bioadv/vbae190","url":null,"abstract":"Motivation: The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability.Results: We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy.Availability and implementation: epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae190"},"PeriodicalIF":2.4,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11646569/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-28 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae191

Matko Glunčić, Domjan Barić, Vladimir Paar

Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.

Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.

Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.

{"title":"Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm.","authors":"Matko Glunčić, Domjan Barić, Vladimir Paar","doi":"10.1093/bioadv/vbae191","DOIUrl":"10.1093/bioadv/vbae191","url":null,"abstract":"Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae191"},"PeriodicalIF":2.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Protomix: a Python package for ¹H-NMR metabolomics data preprocessing.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-27 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae192

Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh

Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.

Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D ¹H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.

Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.

{"title":"Protomix: a Python package for 1H-NMR metabolomics data preprocessing.","authors":"Mohammed Zniber, Youssef Fatihi, Tan-Phat Huynh","doi":"10.1093/bioadv/vbae192","DOIUrl":"10.1093/bioadv/vbae192","url":null,"abstract":"Motivation: NMR-based metabolomics is a field driven by technological advancements, necessitating the use of advanced preprocessing tools. Despite this need, there is a remarkable scarcity of comprehensive and user-friendly preprocessing tools in Python. To bridge this gap, we have developed Protomix-a Python package designed for metabolomics research. Protomix offers a set of automated, efficient, and user-friendly signal-preprocessing steps, tailored to streamline and enhance the preprocessing phase in metabolomics studies.Results: This package presents a comprehensive preprocessing pipeline compatible with various data analysis tools. It encompasses a suite of functionalities for data extraction, preprocessing, and interactive visualization. Additionally, it includes a tutorial in the form of a Python Jupyter notebook, specifically designed for the analysis of 1D 1H-NMR metabolomics data related to prostate cancer and benign prostatic hyperplasia.Availability and implementation: Protomix can be accessed at https://github.com/mzniber/protomix and https://protomix.readthedocs.io/en/latest/index.html.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae192"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142904222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-27 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae189

Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid

Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.

Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.

Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.

{"title":"wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs.","authors":"Sheikh Azizul Hakim, Md Rownok Zahan Ratul, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbae189","DOIUrl":"10.1093/bioadv/vbae189","url":null,"abstract":"Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL. ASTRAL-Pro introduces a refined quartet similarity measure, accounting for orthology and paralogy. On the other hand, DISCO offers a general strategy to decompose multi-copy gene trees into a collection of single-copy trees, allowing the utilization of methods previously designed for species tree inference in the context of single-copy gene trees.Results: In this study, we first introduce some variants of DISCO to examine its underlying hypotheses and present analytical results on the statistical guarantees of DISCO. In particular, we introduce DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and robust results. We then demonstrate with extensive evaluation studies on a collection of simulated and real data sets that wQFM paired with DISCO variants consistently matches or outperforms ASTRAL-Pro and other competing methods.Availability and implementation: DISCO-R and other variants are freely available at https://github.com/skhakim/DISCO-variants.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae189"},"PeriodicalIF":2.4,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11634537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-25 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae188

Francesco Costa, Matthias Blum, Alex Bateman

Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.

Results: Our work shows that about one-third of the time structures with a low plDDT can be "rescued," moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.

Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.

{"title":"Keeping it in the family: using protein family templates to rescue low confidence AlphaFold2 models.","authors":"Francesco Costa, Matthias Blum, Alex Bateman","doi":"10.1093/bioadv/vbae188","DOIUrl":"10.1093/bioadv/vbae188","url":null,"abstract":"Motivation: High confidence structure prediction models have become available for nearly all protein sequences. More than 200 million AlphaFold2 models are now publicly available. We observe that there can be significant variability in the prediction confidence as judged by plDDT scores across a protein family. We have explored whether the predictions with lower plDDT in a family can be improved by the use of higher plDDT templates from the family as template structures in AlphaFold2.Results: Our work shows that about one-third of the time structures with a low plDDT can be \"rescued,\" moved from low to reasonable confidence. We also find that surprisingly in many cases we get a higher plDDT model when we switch off the multiple sequence alignment (MSA) option in AlphaFold2 and solely rely on a high-quality template. However, we find the best overall strategy is to make predictions both with and without the MSA information and select the model with the highest average plDDT. We also find that using high plDDT models as templates can increase the speed of AlphaFold2 as implemented in ColabFold. Additionally, we try to demonstrate that as well as having increased overall plDDT, the models are likely to have higher quality structures as judged by two metrics.Availability and implementation: We have implemented our pipeline in NextFlow and it is available in GitHub: https://github.com/FranceCosta/AF2Fix.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae188"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting the genetic component of gene expression using gene regulatory networks.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-23 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae180

Gutama Ibrahim Mohammad, Tom Michoel

Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.

Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.

Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.

{"title":"Predicting the genetic component of gene expression using gene regulatory networks.","authors":"Gutama Ibrahim Mohammad, Tom Michoel","doi":"10.1093/bioadv/vbae180","DOIUrl":"10.1093/bioadv/vbae180","url":null,"abstract":"Motivation: Gene expression prediction plays a vital role in transcriptome-wide association studies. Traditional models rely on genetic variants in close genomic proximity to the gene of interest to predict the genetic component of gene expression. Here, we propose a novel approach incorporating distal genetic variants acting through gene regulatory networks, in line with the omnigenic model of complex traits.Results: Using causal and coexpression Bayesian networks reconstructed from genomic and transcriptomic data, inference of gene expression from genotypic data is achieved through a two-step process. Initially, the expression level of each gene is predicted using its local genetic variants. The residual differences between the observed and predicted expression levels are then modeled using the genotype information of parent and/or grandparent nodes in the network. The final predicted expression level is obtained by summing the predictions from both models, effectively incorporating both local and distal genetic influences. Using regularized regression techniques for parameter estimation, we found that gene regulatory network-based gene expression prediction outperformed the traditional approach on simulated data and real data from yeast and humans. This study provides important insights into the challenge of gene expression prediction for transcriptome-wide association studies.Availability and implementation: The code is available on Github at github.com/guutama/GRN-TI.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae180"},"PeriodicalIF":2.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665636/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICoN: integration using co-attention across biological networks.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances

Pub Date : 2024-11-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae182

Nure Tasnina, T M Murali

Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.

Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called "co-attention" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.

Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.

{"title":"ICoN: integration using co-attention across biological networks.","authors":"Nure Tasnina, T M Murali","doi":"10.1093/bioadv/vbae182","DOIUrl":"10.1093/bioadv/vbae182","url":null,"abstract":"Motivation: Molecular interaction networks are powerful tools for studying cellular functions. Integrating diverse types of networks enhances performance in downstream tasks such as gene module detection and protein function prediction. The challenge lies in extracting meaningful protein feature representations due to varying levels of sparsity and noise across these heterogeneous networks.Results: We propose ICoN, a novel unsupervised graph neural network model that takes multiple protein-protein association networks as inputs and generates a feature representation for each protein that integrates the topological information from all the networks. A key contribution of ICoN is exploiting a mechanism called \"co-attention\" that enables cross-network communication during training. The model also incorporates a denoising training technique, introducing perturbations to each input network and training the model to reconstruct the original network from its corrupted version. Our experimental results demonstrate that ICoN surpasses individual networks across three downstream tasks: gene module detection, gene coannotation prediction, and protein function prediction. Compared to existing unsupervised network integration models, ICoN exhibits superior performance across the majority of downstream tasks and shows enhanced robustness against noise. This work introduces a promising approach for effectively integrating diverse protein-protein association networks, aiming to achieve a biologically meaningful representation of proteins.Availability and implementation: The ICoN software is available under the GNU Public License v3 at https://github.com/Murali-group/ICoN.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae182"},"PeriodicalIF":2.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0