Pub Date : 2024-10-28eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae168
Maura John, Arthur Korte, Marco Todesco, Dominik G Grimm
Motivation: Permutation-based significance thresholds have been shown to be a robust alternative to classical Bonferroni significance thresholds in genome-wide association studies (GWAS) for skewed phenotype distributions. The recently published method permGWAS introduced a batch-wise approach to efficiently compute permutation-based GWAS. However, running multiple univariate tests in parallel leads to many repetitive computations and increased computational resources. More importantly, traditional permutation methods that permute only the phenotype break the underlying population structure.
Results: We propose permGWAS2, an improved method that does not break the population structure during permutations and uses an elegant block matrix decomposition to optimize computations, thereby reducing redundancies. We show on synthetic data that this improved approach yields a lower false discovery rate for skewed phenotype distributions compared to the previous version and the commonly used Bonferroni correction. In addition, we re-analyze a dataset covering phenotypic variation in 86 traits in a population of 615 wild sunflowers (Helianthus annuus L.). This led to the identification of dozens of novel associations with putatively adaptive traits, and removed several likely false-positive associations with limited biological support.
Availability and implementation: permGWAS2 is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS.
{"title":"Population-aware permutation-based significance thresholds for genome-wide association studies.","authors":"Maura John, Arthur Korte, Marco Todesco, Dominik G Grimm","doi":"10.1093/bioadv/vbae168","DOIUrl":"10.1093/bioadv/vbae168","url":null,"abstract":"<p><strong>Motivation: </strong>Permutation-based significance thresholds have been shown to be a robust alternative to classical Bonferroni significance thresholds in genome-wide association studies (GWAS) for skewed phenotype distributions. The recently published method permGWAS introduced a batch-wise approach to efficiently compute permutation-based GWAS. However, running multiple univariate tests in parallel leads to many repetitive computations and increased computational resources. More importantly, traditional permutation methods that permute only the phenotype break the underlying population structure.</p><p><strong>Results: </strong>We propose permGWAS2, an improved method that does not break the population structure during permutations and uses an elegant block matrix decomposition to optimize computations, thereby reducing redundancies. We show on synthetic data that this improved approach yields a lower false discovery rate for skewed phenotype distributions compared to the previous version and the commonly used Bonferroni correction. In addition, we re-analyze a dataset covering phenotypic variation in 86 traits in a population of 615 wild sunflowers (<i>Helianthus annuus</i> L.). This led to the identification of dozens of novel associations with putatively adaptive traits, and removed several likely false-positive associations with limited biological support.</p><p><strong>Availability and implementation: </strong>permGWAS2 is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae168"},"PeriodicalIF":2.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-23eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae162
Simon G Coetzee, Dennis J Hazelett
Motivation: motifbreakR scans genetic variants against position weight matrices of transcription factors (TFs) to determine the potential for the disruption of binding at the site of the variant. It leverages the Bioconductor suite of software packages and annotations to query a diverse array of genomes and motif databases. Initially developed to interrogate the effect of single-nucleotide variants on TF binding sites, in motifbreakR v2, we have updated the functionality.
Results: New features include the ability to query other types of complex genetic variants, such as short insertions and deletions. This capability allows modeling a more extensive array of variants that may have significant effects on TF binding. Additionally, predictions based on sequence preference alone can indicate many more potential binding events than observed. Adding information from DNA-binding sequencing datasets lends confidence to motif disruption prediction by demonstrating TF binding in cell lines and tissue types. Therefore, motifbreakR can directly query the ReMap2022 database for evidence that a TF matching the disrupted motif binds over the disrupting variant. Finally, in motifbreakR, in addition to the existing interface, we implemented an R/Shiny graphical user interface to simplify and enhance access to researchers with different skill sets.
Availability and implementation: motifbreakR is implemented in R. Source code, documentation, and tutorials are available on Bioconductor at https://bioconductor.org/packages/release/bioc/html/motifbreakR.html and GitHub at https://github.com/Simon-Coetzee/motifBreakR.
{"title":"<i>motifbreakR</i> v2: expanded variant analysis including indels and integrated evidence from transcription factor binding databases.","authors":"Simon G Coetzee, Dennis J Hazelett","doi":"10.1093/bioadv/vbae162","DOIUrl":"https://doi.org/10.1093/bioadv/vbae162","url":null,"abstract":"<p><strong>Motivation: </strong><i>motifbreakR</i> scans genetic variants against position weight matrices of transcription factors (TFs) to determine the potential for the disruption of binding at the site of the variant. It leverages the Bioconductor suite of software packages and annotations to query a diverse array of genomes and motif databases. Initially developed to interrogate the effect of single-nucleotide variants on TF binding sites, in <i>motifbreakR</i> v2, we have updated the functionality.</p><p><strong>Results: </strong>New features include the ability to query other types of complex genetic variants, such as short insertions and deletions. This capability allows modeling a more extensive array of variants that may have significant effects on TF binding. Additionally, predictions based on sequence preference alone can indicate many more potential binding events than observed. Adding information from DNA-binding sequencing datasets lends confidence to motif disruption prediction by demonstrating TF binding in cell lines and tissue types. Therefore, <i>motifbreakR can directly query</i> the ReMap2022 database for evidence that a TF matching the disrupted motif binds over the disrupting variant. Finally, in <i>motifbreakR</i>, in addition to the existing interface, we implemented an R/Shiny graphical user interface to simplify and enhance access to researchers with different skill sets.</p><p><strong>Availability and implementation: </strong><i>motifbreakR</i> is implemented in R. Source code, documentation, and tutorials are available on Bioconductor at https://bioconductor.org/packages/release/bioc/html/motifbreakR.html and GitHub at https://github.com/Simon-Coetzee/motifBreakR.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae162"},"PeriodicalIF":2.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11520234/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae152
Mariia Zelenskaia, Yazhini Arangasamy, Milot Mirdita, Johannes Söding, Venket Raghavan
Summary: The annotation of deeply sequenced, de novo assembled transcriptomes continues to be a challenge as some of the state-of-the-art tools are slow, difficult to install, and hard to use. We have tackled these issues with TransAnnot, a fast, automated transcriptome annotation pipeline that is easy to install and use. Leveraging the fast sequence searches provided by the MMseqs2 suite, TransAnnot offers one-step annotation of homologs from Swiss-Prot, gene ontology terms and orthogroups from eggNOG, and functional domains from Pfam. Users also have the option to annotate against custom databases. TransAnnot accepts sequencing reads (short and long), nucleotide sequences, or amino acid sequences as input for annotation. When benchmarked with test data sets of amino acid sequences, TransAnnot was 333, 284, and 18 times faster than comparable tools such as EnTAP, Trinotate, and eggNOG-mapper respectively.
Availability and implementation: TransAnnot is free to use, open sourced under GPLv3, and is implemented in C++ and Bash. Source code, documentation, and pre-compiled binaries are available at https://github.com/soedinglab/transannot. TransAnnot is also available via bioconda (https://anaconda.org/bioconda/transannot).
{"title":"TransAnnot-a fast transcriptome annotation pipeline.","authors":"Mariia Zelenskaia, Yazhini Arangasamy, Milot Mirdita, Johannes Söding, Venket Raghavan","doi":"10.1093/bioadv/vbae152","DOIUrl":"10.1093/bioadv/vbae152","url":null,"abstract":"<p><strong>Summary: </strong>The annotation of deeply sequenced, <i>de novo</i> assembled transcriptomes continues to be a challenge as some of the state-of-the-art tools are slow, difficult to install, and hard to use. We have tackled these issues with TransAnnot, a fast, automated transcriptome annotation pipeline that is easy to install and use. Leveraging the fast sequence searches provided by the MMseqs2 suite, TransAnnot offers one-step annotation of homologs from Swiss-Prot, gene ontology terms and orthogroups from eggNOG, and functional domains from Pfam. Users also have the option to annotate against custom databases. TransAnnot accepts sequencing reads (short and long), nucleotide sequences, or amino acid sequences as input for annotation. When benchmarked with test data sets of amino acid sequences, TransAnnot was 333, 284, and 18 times faster than comparable tools such as EnTAP, Trinotate, and eggNOG-mapper respectively.</p><p><strong>Availability and implementation: </strong>TransAnnot is free to use, open sourced under GPLv3, and is implemented in C++ and Bash. Source code, documentation, and pre-compiled binaries are available at https://github.com/soedinglab/transannot. TransAnnot is also available via bioconda (https://anaconda.org/bioconda/transannot).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae152"},"PeriodicalIF":2.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11530227/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae161
Matthew Crown, Matthew Bashton
Motivation: Mappings of domain-cognate ligand interactions can enhance our understanding of the core concepts of evolution and be used to aid docking and protein design. Since the last available cognate-ligand domain database was released, the PDB has grown significantly and new tools are available for measuring similarity and determining contacts.
Results: We present ProCogGraph, a graph database of cognate-ligand domain mappings in PDB structures. Building upon the work of the predecessor database, PROCOGNATE, we use data-driven approaches to develop thresholds and interaction modes. We explore new aspects of domain-cognate ligand interactions, including the chemical similarity of bound cognate ligands and how domain combinations influence cognate ligand binding. Finally, we use the graph to add specificity to partial EC IDs, showing that ProCogGraph can complete partial annotations systematically through assigned cognate ligands.
Availability and implementation: The ProCogGraph pipeline, database and flat files are available at https://github.com/bashton-lab/ProCogGraph and https://doi.org/10.5281/zenodo.13165851.
{"title":"ProCogGraph: a graph-based mapping of cognate ligand domain interactions.","authors":"Matthew Crown, Matthew Bashton","doi":"10.1093/bioadv/vbae161","DOIUrl":"10.1093/bioadv/vbae161","url":null,"abstract":"<p><strong>Motivation: </strong>Mappings of domain-cognate ligand interactions can enhance our understanding of the core concepts of evolution and be used to aid docking and protein design. Since the last available cognate-ligand domain database was released, the PDB has grown significantly and new tools are available for measuring similarity and determining contacts.</p><p><strong>Results: </strong>We present ProCogGraph, a graph database of cognate-ligand domain mappings in PDB structures. Building upon the work of the predecessor database, PROCOGNATE, we use data-driven approaches to develop thresholds and interaction modes. We explore new aspects of domain-cognate ligand interactions, including the chemical similarity of bound cognate ligands and how domain combinations influence cognate ligand binding. Finally, we use the graph to add specificity to partial EC IDs, showing that ProCogGraph can complete partial annotations systematically through assigned cognate ligands.</p><p><strong>Availability and implementation: </strong>The ProCogGraph pipeline, database and flat files are available at https://github.com/bashton-lab/ProCogGraph and https://doi.org/10.5281/zenodo.13165851.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae161"},"PeriodicalIF":2.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11561043/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-21eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae159
Mark Ziemann, Barry Schroeter, Anusuiya Bora
Motivation: Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.
Results: Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.
Availability and implementation: An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).
动机过度代表性分析(ORA)被广泛用于评估基因列表与背景列表相比功能类别的富集程度。因此,ORA 是解释'omics'数据的重要方法,它将基因列表与生物功能和主题联系起来。虽然 ORA 大受欢迎,但我们和其他人注意到一些 ORA 工具可能存在两种不受欢迎的行为。第一种我们称之为 "背景问题",因为它涉及软件从背景列表中剔除大量未注释为属于任何类别的基因。第二个问题我们称之为 "错误发现率问题",因为有些工具低估了并行测试的真实数量:在这里,我们展示了这些问题对几个真实 RNA-seq 数据集的影响,并使用模拟 RNA-seq 数据来量化这些问题的影响。我们发现,这些问题的严重程度取决于基因组库、列表中的基因数量以及数据集中的噪声程度。这些问题可以通过更换 ORA 的软件包/网站或改用其他方法(如功能分类评分)来缓解:R/Shiny 工具已在 https://oratool.ziemann-lab.net/ 上提供,辅助材料可从 Zenodo (https://zenodo.org/records/13823301) 获取。
{"title":"Two subtle problems with overrepresentation analysis.","authors":"Mark Ziemann, Barry Schroeter, Anusuiya Bora","doi":"10.1093/bioadv/vbae159","DOIUrl":"10.1093/bioadv/vbae159","url":null,"abstract":"<p><strong>Motivation: </strong>Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.</p><p><strong>Results: </strong>Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.</p><p><strong>Availability and implementation: </strong>An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae159"},"PeriodicalIF":2.4,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11557902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae156
Na Zhao, David L Bennett, Georgios Baskozos, Allison M Barry
Motivation: Accurate identification of pain-related genes remains challenging due to the complex nature of pain pathophysiology and the subjective nature of pain reporting in humans. Here, we use machine learning to identify possible 'pain genes'. Labelling was based on a gold-standard list with validated involvement across pain conditions, and was trained on a selection of -omics, protein-protein interaction network features, and biological function readouts for each gene.
Results: The top-performing model was selected to predict a 'pain score' per gene. The top-ranked genes were then validated against pain-related human SNPs. Functional analysis revealed JAK2/STAT3 signal, ErbB, and Rap1 signalling pathways as promising targets for further exploration, while network topological features contribute significantly to the identification of 'pain' genes. As such, a network based on top-ranked genes was constructed to reveal previously uncharacterized pain-related genes. Together, these novel insights into pain pathogenesis can indicate promising directions for future experimental research.
Availability and implementation: These analyses can be further explored using the linked open-source database at https://livedataoxford.shinyapps.io/drg-directory/, which is accompanied by a freely accessible code template and user guide for wider adoption across disciplines.
{"title":"Predicting 'pain genes': multi-modal data integration using probabilistic classifiers and interaction networks.","authors":"Na Zhao, David L Bennett, Georgios Baskozos, Allison M Barry","doi":"10.1093/bioadv/vbae156","DOIUrl":"https://doi.org/10.1093/bioadv/vbae156","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate identification of pain-related genes remains challenging due to the complex nature of pain pathophysiology and the subjective nature of pain reporting in humans. Here, we use machine learning to identify possible 'pain genes'. Labelling was based on a gold-standard list with validated involvement across pain conditions, and was trained on a selection of -omics, protein-protein interaction network features, and biological function readouts for each gene.</p><p><strong>Results: </strong>The top-performing model was selected to predict a 'pain score' per gene. The top-ranked genes were then validated against pain-related human SNPs. Functional analysis revealed JAK2/STAT3 signal, ErbB, and Rap1 signalling pathways as promising targets for further exploration, while network topological features contribute significantly to the identification of 'pain' genes. As such, a network based on top-ranked genes was constructed to reveal previously uncharacterized pain-related genes. Together, these novel insights into pain pathogenesis can indicate promising directions for future experimental research.</p><p><strong>Availability and implementation: </strong>These analyses can be further explored using the linked open-source database at https://livedataoxford.shinyapps.io/drg-directory/, which is accompanied by a freely accessible code template and user guide for wider adoption across disciplines.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae156"},"PeriodicalIF":2.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549022/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae154
Dea Gogishvili, Emmanuel Minois-Genin, Jan van Eck, Sanne Abeln
Motivation: Hydrophobic patches on protein surfaces play important functional roles in protein-protein and protein-ligand interactions. Large hydrophobic surfaces are also involved in the progression of aggregation diseases. Predicting exposed hydrophobic patches from a protein sequence has shown to be a difficult task. Fine-tuning foundation models allows for adapting a model to the specific nuances of a new task using a much smaller dataset. Additionally, multitask deep learning offers a promising solution for addressing data gaps, simultaneously outperforming single-task methods.
Results: In this study, we harnessed a recently released leading large language model Evolutionary Scale Models (ESM-2). Efficient fine-tuning of ESM-2 was achieved by leveraging a recently developed parameter-efficient fine-tuning method. This approach enabled comprehensive training of model layers without excessive parameters and without the need to include a computationally expensive multiple sequence analysis. We explored several related tasks, at local (residue) and global (protein) levels, to improve the representation of the model. As a result, our model, PatchProt, cannot only predict hydrophobic patch areas but also outperforms existing methods at predicting primary tasks, including secondary structure and surface accessibility predictions. Importantly, our analysis shows that including related local tasks can improve predictions on more difficult global tasks. This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.
Availability and implementation: https://github.com/Deagogishvili/chapter-multi-task.
{"title":"PatchProt: hydrophobic patch prediction using protein foundation models.","authors":"Dea Gogishvili, Emmanuel Minois-Genin, Jan van Eck, Sanne Abeln","doi":"10.1093/bioadv/vbae154","DOIUrl":"10.1093/bioadv/vbae154","url":null,"abstract":"<p><strong>Motivation: </strong>Hydrophobic patches on protein surfaces play important functional roles in protein-protein and protein-ligand interactions. Large hydrophobic surfaces are also involved in the progression of aggregation diseases. Predicting exposed hydrophobic patches from a protein sequence has shown to be a difficult task. Fine-tuning foundation models allows for adapting a model to the specific nuances of a new task using a much smaller dataset. Additionally, multitask deep learning offers a promising solution for addressing data gaps, simultaneously outperforming single-task methods.</p><p><strong>Results: </strong>In this study, we harnessed a recently released leading large language model Evolutionary Scale Models (ESM-2). Efficient fine-tuning of ESM-2 was achieved by leveraging a recently developed parameter-efficient fine-tuning method. This approach enabled comprehensive training of model layers without excessive parameters and without the need to include a computationally expensive multiple sequence analysis. We explored several related tasks, at local (residue) and global (protein) levels, to improve the representation of the model. As a result, our model, PatchProt, cannot only predict hydrophobic patch areas but also outperforms existing methods at predicting primary tasks, including secondary structure and surface accessibility predictions. Importantly, our analysis shows that including related local tasks can improve predictions on more difficult global tasks. This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.</p><p><strong>Availability and implementation: </strong>https://github.com/Deagogishvili/chapter-multi-task.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae154"},"PeriodicalIF":2.4,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11525051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-11eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae153
Greta Bellinzona, Davide Sassera, Alexandre M J J Bonvin
Motivation: Discovering new protein-protein interactions (PPIs) across entire proteomes offers vast potential for understanding novel protein functions and elucidate system properties within or between an organism. While recent advances in computational structural biology, particularly AlphaFold-Multimer, have facilitated this task, scaling for large-scale screenings remains a challenge, requiring significant computational resources.
Results: We evaluated the impact of reducing the number of models generated by AlphaFold-Multimer from five to one on the method's ability to distinguish true PPIs from false ones. Our evaluation was conducted on a dataset containing both intra- and inter-species PPIs, which included proteins from bacterial and eukaryotic sources. We demonstrate that reducing the sampling does not compromise the accuracy of the method, offering a faster, efficient, and environmentally friendly solution for PPI predictions.
Availability and implementation: The code used in this article is available at https://github.com/MIDIfactory/AlphaFastPPi. Note that the same can be achieved using the latest version of AlphaPulldown available at https://github.com/KosinskiLab/AlphaPulldown.
动机在整个蛋白质组中发现新的蛋白质-蛋白质相互作用(PPIs)为了解新的蛋白质功能和阐明生物体内或生物体之间的系统特性提供了巨大的潜力。虽然计算结构生物学(尤其是 AlphaFold-Multimer)的最新进展促进了这项任务的完成,但大规模筛选的扩展仍是一项挑战,需要大量的计算资源:我们评估了将 AlphaFold-Multimer 生成的模型数量从五个减少到一个对该方法区分真假 PPI 的能力的影响。我们的评估是在一个包含种内和种间 PPI 的数据集上进行的,其中包括来自细菌和真核生物的蛋白质。我们证明,减少采样并不会影响该方法的准确性,从而为 PPI 预测提供了一种更快、更高效、更环保的解决方案:本文使用的代码可从 https://github.com/MIDIfactory/AlphaFastPPi 网站获取。请注意,使用 https://github.com/KosinskiLab/AlphaPulldown 上最新版本的 AlphaPulldown 也能实现同样的效果。
{"title":"Accelerating protein-protein interaction screens with reduced AlphaFold-Multimer sampling.","authors":"Greta Bellinzona, Davide Sassera, Alexandre M J J Bonvin","doi":"10.1093/bioadv/vbae153","DOIUrl":"10.1093/bioadv/vbae153","url":null,"abstract":"<p><strong>Motivation: </strong>Discovering new protein-protein interactions (PPIs) across entire proteomes offers vast potential for understanding novel protein functions and elucidate system properties within or between an organism. While recent advances in computational structural biology, particularly AlphaFold-Multimer, have facilitated this task, scaling for large-scale screenings remains a challenge, requiring significant computational resources.</p><p><strong>Results: </strong>We evaluated the impact of reducing the number of models generated by AlphaFold-Multimer from five to one on the method's ability to distinguish true PPIs from false ones. Our evaluation was conducted on a dataset containing both intra- and inter-species PPIs, which included proteins from bacterial and eukaryotic sources. We demonstrate that reducing the sampling does not compromise the accuracy of the method, offering a faster, efficient, and environmentally friendly solution for PPI predictions.</p><p><strong>Availability and implementation: </strong>The code used in this article is available at https://github.com/MIDIfactory/AlphaFastPPi. Note that the same can be achieved using the latest version of AlphaPulldown available at https://github.com/KosinskiLab/AlphaPulldown.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae153"},"PeriodicalIF":2.4,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11513016/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142513907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae149
Daniel R Olson, Travis J Wheeler
In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read-sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally repetitive sequence, ULTRA Locates Tandemly Repetitive Areas (ULTRA). ULTRA is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions.
Availability and implementation: ULTRA is released under an open source license, and is available for download at https://github.com/TravisWheelerLab/ULTRA.
{"title":"ULTRA-effective labeling of tandem repeats in genomic sequence.","authors":"Daniel R Olson, Travis J Wheeler","doi":"10.1093/bioadv/vbae149","DOIUrl":"10.1093/bioadv/vbae149","url":null,"abstract":"<p><p>In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read-sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally repetitive sequence, <b>U</b>LTRA <b>L</b>ocates <b>T</b>andemly <b>R</b>epetitive <b>A</b>reas (<i>ULTRA</i>). <i>ULTRA</i> is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions.</p><p><strong>Availability and implementation: </strong>ULTRA is released under an open source license, and is available for download at https://github.com/TravisWheelerLab/ULTRA.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae149"},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11580682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae151
Heming Zhang, Dekang Cao, Zirui Chen, Xiuyuan Zhang, Yixin Chen, Cole Sessions, Carlos Cruchaga, Philip Payne, Guangfu Li, Michael Province, Fuhai Li
Motivation: Multi-omics data, i.e. genomics, epigenomics, transcriptomics, proteomics, characterize cellular complex signaling systems from multi-level and multi-view and provide a holistic view of complex cellular signaling pathways. However, it remains challenging to integrate and interpret multi-omics data for mining critical biomarkers. Graph AI models have been widely used to analyze graph-structure datasets, and are ideal for integrative multi-omics data analysis because they can naturally integrate and represent multi-omics data as a biologically meaningful multi-level signaling graph and interpret multi-omics data via graph node and edge ranking analysis. Nevertheless, it is nontrivial for graph-AI model developers to pre-analyze multi-omics data and convert the data into biologically meaningful graphs, which can be directly fed into graph-AI models.
Results: To resolve this challenge, we developed mosGraphGen (multi-omics signaling graph generator), generating Multi-omics Signaling graphs (mos-graph) of individual samples by mapping multi-omics data onto a biologically meaningful multi-level background signaling network with data normalization by aggregating measurements and aligning to the reference genome. With mosGraphGen, AI model developers can directly apply and evaluate their models using these mos-graphs. In the results, mosGraphGen was used and illustrated using two widely used multi-omics datasets of The Cancer Genome Atlas (TCGA) and Alzheimer's disease (AD) samples.
Availability and implementation: The code of mosGraphGen is open-source and publicly available via GitHub: https://github.com/FuhaiLiAiLab/mosGraphGen.
{"title":"mosGraphGen: a novel tool to generate multi-omics signaling graphs to facilitate integrative and interpretable graph AI model development.","authors":"Heming Zhang, Dekang Cao, Zirui Chen, Xiuyuan Zhang, Yixin Chen, Cole Sessions, Carlos Cruchaga, Philip Payne, Guangfu Li, Michael Province, Fuhai Li","doi":"10.1093/bioadv/vbae151","DOIUrl":"10.1093/bioadv/vbae151","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-omics data, i.e. genomics, epigenomics, transcriptomics, proteomics, characterize cellular complex signaling systems from multi-level and multi-view and provide a holistic view of complex cellular signaling pathways. However, it remains challenging to integrate and interpret multi-omics data for mining critical biomarkers. Graph AI models have been widely used to analyze graph-structure datasets, and are ideal for integrative multi-omics data analysis because they can naturally integrate and represent multi-omics data as a biologically meaningful multi-level signaling graph and interpret multi-omics data via graph node and edge ranking analysis. Nevertheless, it is nontrivial for graph-AI model developers to pre-analyze multi-omics data and convert the data into biologically meaningful graphs, which can be directly fed into graph-AI models.</p><p><strong>Results: </strong>To resolve this challenge, we developed mosGraphGen (multi-omics signaling graph generator), generating Multi-omics Signaling graphs (mos-graph) of individual samples by mapping multi-omics data onto a biologically meaningful multi-level background signaling network with data normalization by aggregating measurements and aligning to the reference genome. With mosGraphGen, AI model developers can directly apply and evaluate their models using these mos-graphs. In the results, mosGraphGen was used and illustrated using two widely used multi-omics datasets of The Cancer Genome Atlas (TCGA) and Alzheimer's disease (AD) samples.</p><p><strong>Availability and implementation: </strong>The code of mosGraphGen is open-source and publicly available via GitHub: https://github.com/FuhaiLiAiLab/mosGraphGen.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae151"},"PeriodicalIF":2.4,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11540438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142592400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}