Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae659
Oliver M Turnbull, Dino Oglic, Rebecca Croasdale-Wood, Charlotte M Deane
Summary: A key challenge in antibody drug discovery is designing novel sequences that are free from developability issues-such as aggregation, polyspecificity, poor expression, or low solubility. Here, we present p-IgGen, a protein language model for paired heavy-light chain antibody generation. The model generates diverse, antibody-like sequences with pairing properties found in natural antibodies. We also create a finetuned version of p-IgGen that biases the model to generate antibodies with 3D biophysical properties that fall within distributions seen in clinical-stage therapeutic antibodies.
Availability and implementation: The model and inference code are freely available at www.github.com/oxpig/p-IgGen. Cleaned training data are deposited at doi.org/10.5281/zenodo.13880874.
{"title":"p-IgGen: a paired antibody generative language model.","authors":"Oliver M Turnbull, Dino Oglic, Rebecca Croasdale-Wood, Charlotte M Deane","doi":"10.1093/bioinformatics/btae659","DOIUrl":"10.1093/bioinformatics/btae659","url":null,"abstract":"<p><strong>Summary: </strong>A key challenge in antibody drug discovery is designing novel sequences that are free from developability issues-such as aggregation, polyspecificity, poor expression, or low solubility. Here, we present p-IgGen, a protein language model for paired heavy-light chain antibody generation. The model generates diverse, antibody-like sequences with pairing properties found in natural antibodies. We also create a finetuned version of p-IgGen that biases the model to generate antibodies with 3D biophysical properties that fall within distributions seen in clinical-stage therapeutic antibodies.</p><p><strong>Availability and implementation: </strong>The model and inference code are freely available at www.github.com/oxpig/p-IgGen. Cleaned training data are deposited at doi.org/10.5281/zenodo.13880874.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576349/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae651
Juan C Muñoz-Sánchez, María J Olmo-Uceda, José-Ángel Oteo, Santiago F Elena
Motivation: Defective viral genomes (DVGs) are variants of the wild-type (wt) virus that lack the ability to complete autonomously an infectious cycle. However, in the presence of their parental (helper) wt virus, DVGs can interfere with the replication, encapsidation, and spread of functional genomes, acting as a significant selective force in viral evolution. DVGs also affect the host's immune responses and are linked to chronic infections and milder symptoms. Thus, identifying and characterizing DVGs is crucial for understanding infection prognosis. Quantifying DVGs is challenging due to their inability to sustain themselves, which makes it difficult to distinguish them from the helper virus, especially using high-throughput RNA sequencing. An accurate quantification is essential for understanding their very dynamical interactions with the helper virus.
Results: We present a method to simultaneously estimate the abundances of DVGs and wt genomes within a sample by identifying genomic regions with significant deviations from the expected sequencing depth. Our approach involves reconstructing the depth profile through a linear system of equations, which provides an estimate of the number of wt and DVG genomes of each type. Until now, in silico methods have only estimated the DVG-to-wt ratio for localized genomic regions. This is the first method that simultaneously estimates the proportions of wt and DVGs genome wide from short-reads RNA sequencing.
Availability and implementation: The Matlab code and the synthetic datasets are freely available at https://github.com/jmusan/wtDVGquantific.
{"title":"Quantifying defective and wild-type viruses from high-throughput RNA sequencing.","authors":"Juan C Muñoz-Sánchez, María J Olmo-Uceda, José-Ángel Oteo, Santiago F Elena","doi":"10.1093/bioinformatics/btae651","DOIUrl":"10.1093/bioinformatics/btae651","url":null,"abstract":"<p><strong>Motivation: </strong>Defective viral genomes (DVGs) are variants of the wild-type (wt) virus that lack the ability to complete autonomously an infectious cycle. However, in the presence of their parental (helper) wt virus, DVGs can interfere with the replication, encapsidation, and spread of functional genomes, acting as a significant selective force in viral evolution. DVGs also affect the host's immune responses and are linked to chronic infections and milder symptoms. Thus, identifying and characterizing DVGs is crucial for understanding infection prognosis. Quantifying DVGs is challenging due to their inability to sustain themselves, which makes it difficult to distinguish them from the helper virus, especially using high-throughput RNA sequencing. An accurate quantification is essential for understanding their very dynamical interactions with the helper virus.</p><p><strong>Results: </strong>We present a method to simultaneously estimate the abundances of DVGs and wt genomes within a sample by identifying genomic regions with significant deviations from the expected sequencing depth. Our approach involves reconstructing the depth profile through a linear system of equations, which provides an estimate of the number of wt and DVG genomes of each type. Until now, in silico methods have only estimated the DVG-to-wt ratio for localized genomic regions. This is the first method that simultaneously estimates the proportions of wt and DVGs genome wide from short-reads RNA sequencing.</p><p><strong>Availability and implementation: </strong>The Matlab code and the synthetic datasets are freely available at https://github.com/jmusan/wtDVGquantific.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583936/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae632
Katalin Ferenc, Ieva Rauluseviciute, Ladislav Hovan, Vipin Kumar, Marieke L Kuijjer, Anthony Mathelier
Summary: Since high-throughput techniques became a staple in biological science laboratories, computational algorithms, and scientific software have boomed. However, the development of bioinformatics software usually lacks software development quality standards. The resulting software code is hard to test, reuse, and maintain. We believe that the root of inefficiency in implementing the best software development practices in academic settings is the individualistic approach, which has traditionally been the norm for recognizing scientific achievements and, by extension, for developing specialized software. Software development is a collective effort in most software-heavy endeavors. Indeed, the literature suggests teamwork directly impacts code quality through knowledge sharing, collective software development, and established coding standards. In our computational biology research groups, we sustainably involve all group members in learning, sharing, and discussing software development while maintaining the personal ownership of research projects and related software products. We found that group members involved in this endeavor improved their coding skills, became more efficient bioinformaticians, and obtained detailed knowledge about their peers' work, triggering new collaborative projects. We strongly advocate for improving software development culture within bioinformatics through collective effort in computational biology groups or institutes with three or more bioinformaticians.
Availability and implementation: Additional information and guidance on how to get started is available at https://ferenckata.github.io/ImprovingSoftwareTogether.github.io/.
{"title":"Improving bioinformatics software quality through teamwork.","authors":"Katalin Ferenc, Ieva Rauluseviciute, Ladislav Hovan, Vipin Kumar, Marieke L Kuijjer, Anthony Mathelier","doi":"10.1093/bioinformatics/btae632","DOIUrl":"10.1093/bioinformatics/btae632","url":null,"abstract":"<p><strong>Summary: </strong>Since high-throughput techniques became a staple in biological science laboratories, computational algorithms, and scientific software have boomed. However, the development of bioinformatics software usually lacks software development quality standards. The resulting software code is hard to test, reuse, and maintain. We believe that the root of inefficiency in implementing the best software development practices in academic settings is the individualistic approach, which has traditionally been the norm for recognizing scientific achievements and, by extension, for developing specialized software. Software development is a collective effort in most software-heavy endeavors. Indeed, the literature suggests teamwork directly impacts code quality through knowledge sharing, collective software development, and established coding standards. In our computational biology research groups, we sustainably involve all group members in learning, sharing, and discussing software development while maintaining the personal ownership of research projects and related software products. We found that group members involved in this endeavor improved their coding skills, became more efficient bioinformaticians, and obtained detailed knowledge about their peers' work, triggering new collaborative projects. We strongly advocate for improving software development culture within bioinformatics through collective effort in computational biology groups or institutes with three or more bioinformaticians.</p><p><strong>Availability and implementation: </strong>Additional information and guidance on how to get started is available at https://ferenckata.github.io/ImprovingSoftwareTogether.github.io/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11537420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142514575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae582
Bryan Granger, Stefano Berto
Motivation: The scToppR package provides a ToppGene interface from R programs/scripts to fully access/control the database for functional enrichment without the need for active interaction on its Web site (https://toppgene.cchmc.org/).
Results: The library facilitates the functional enrichment analysis and visualization by interacting with ToppGene, downloading the functional enrichment dataframes, and using R environment to visualize the final results.
Availability and implementation: Code and documentation are currently available at https://github.com/BioinformaticsMUSC/scToppR.
动机scToppR软件包为R程序/脚本提供了一个ToppGene接口,以便完全访问/控制功能富集数据库,而无需在其网站(https://toppgene.cchmc.org/)上进行主动交互。结果:该库通过与ToppGene交互,下载功能富集数据帧,并使用R环境将最终结果可视化,促进了功能富集分析和可视化:该库通过与 ToppGene 交互、下载功能富集数据帧以及使用 R 环境对最终结果进行可视化,促进了功能富集分析和可视化:代码和文档目前可从 https://github.com/BioinformaticsMUSC/scToppR.Supplementary 信息中获取:补充数据可在 Bioinformatics online 上获取。
{"title":"scToppR: a coding-friendly R interface to ToppGene.","authors":"Bryan Granger, Stefano Berto","doi":"10.1093/bioinformatics/btae582","DOIUrl":"10.1093/bioinformatics/btae582","url":null,"abstract":"<p><strong>Motivation: </strong>The scToppR package provides a ToppGene interface from R programs/scripts to fully access/control the database for functional enrichment without the need for active interaction on its Web site (https://toppgene.cchmc.org/).</p><p><strong>Results: </strong>The library facilitates the functional enrichment analysis and visualization by interacting with ToppGene, downloading the functional enrichment dataframes, and using R environment to visualize the final results.</p><p><strong>Availability and implementation: </strong>Code and documentation are currently available at https://github.com/BioinformaticsMUSC/scToppR.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552619/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142334307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae616
Alessandro Lisi, Michael C Campbell
Summary: Admixture is a fundamental process that has shaped levels and patterns of genetic variation in human populations. RFMIX version 2 (RFMIX2) utilizes a robust modeling approach to identify the genetic ancestries in admixed populations. However, this software does not have a built-in method to visually summarize the results of analyses. Here, we introduce the AncestryGrapher toolkit, which converts the numerical output of RFMIX2 into graphical representations of global and local ancestry (i.e. the per-individual ancestry components and the genetic ancestry along chromosomes, respectively).
Results: To demonstrate the utility of our methods, we applied the AncestryGrapher toolkit to visualize the global and local ancestry of individuals in the North African Mozabite Berber population from the Human Genome Diversity Panel. Our results showed that the Mozabite Berbers derived their ancestry from the Middle East, Europe, and sub-Saharan Africa (global ancestry). We also found that the population origin of ancestry varied considerably along chromosomes (local ancestry). For example, we observed variance in local ancestry in the genomic region on Chromosome 2 containing the regulatory sequence in the MCM6 gene associated with lactase persistence, a human trait tied to the cultural development of adult milk consumption. Overall, the AncestryGrapher toolkit facilitates the exploration, interpretation, and reporting of ancestry patterns in human populations.
Availability and implementation: The AncestryGrapher toolkit is free and open source on https://github.com/alisi1989/RFmix2-Pipeline-to-plot.
{"title":"AncestryGrapher toolkit: Python command-line pipelines to visualize global- and local- ancestry inferences from the RFMIX version 2 software.","authors":"Alessandro Lisi, Michael C Campbell","doi":"10.1093/bioinformatics/btae616","DOIUrl":"10.1093/bioinformatics/btae616","url":null,"abstract":"<p><strong>Summary: </strong>Admixture is a fundamental process that has shaped levels and patterns of genetic variation in human populations. RFMIX version 2 (RFMIX2) utilizes a robust modeling approach to identify the genetic ancestries in admixed populations. However, this software does not have a built-in method to visually summarize the results of analyses. Here, we introduce the AncestryGrapher toolkit, which converts the numerical output of RFMIX2 into graphical representations of global and local ancestry (i.e. the per-individual ancestry components and the genetic ancestry along chromosomes, respectively).</p><p><strong>Results: </strong>To demonstrate the utility of our methods, we applied the AncestryGrapher toolkit to visualize the global and local ancestry of individuals in the North African Mozabite Berber population from the Human Genome Diversity Panel. Our results showed that the Mozabite Berbers derived their ancestry from the Middle East, Europe, and sub-Saharan Africa (global ancestry). We also found that the population origin of ancestry varied considerably along chromosomes (local ancestry). For example, we observed variance in local ancestry in the genomic region on Chromosome 2 containing the regulatory sequence in the MCM6 gene associated with lactase persistence, a human trait tied to the cultural development of adult milk consumption. Overall, the AncestryGrapher toolkit facilitates the exploration, interpretation, and reporting of ancestry patterns in human populations.</p><p><strong>Availability and implementation: </strong>The AncestryGrapher toolkit is free and open source on https://github.com/alisi1989/RFmix2-Pipeline-to-plot.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11534077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: Advanced genomic technologies have generated thousands of protein-nucleic acid binding datasets that have the potential to identify testable gene regulatory network (GRNs) models governed by combinatorial associations between factors. Transcription factors (TFs), and RNA binding proteins (RBPs) are nucleic-acid binding proteins regulating gene expression and are key drivers of GRN function. However, the combinatorial mechanisms by which the interactions between specific TFs and RBPs regulate gene expression remain largely unknown. To identify possible combinations of TFs and RBPs that may function together, developing a tool that compares and contrasts the interactions of multiple TFs and RBPs with nucleic acids to identify their common and unique targets is necessary. Therefore, we introduce BindCompare, a user-friendly tool that can be run locally to predict new combinatorial relationships between TFs and RBPs. BindCompare can analyze data from any organism with known annotated genome information and outputs files with detailed genomic locations and gene information for targets for downstream analysis. Overall, BindCompare is a new tool that identifies TFs and RBPs that co-bind to the same DNA and/or RNA loci, generating testable hypotheses about their combinatorial regulation of target genes.
Availability and implementation: BindCompare is an open-source package that is available on the Python Packaging Index (PyPI, https://pypi.org/project/bindcompare/) with the source code available on GitHub (https://github.com/pranavmahabs/bindcompare). Complete documentation for the package can be found at both links.
{"title":"BindCompare: a novel integrated protein-nucleic acid binding analysis platform.","authors":"Pranav Mahableshwarkar, Jasmine Shum, Mukulika Ray, Erica Larschan","doi":"10.1093/bioinformatics/btae668","DOIUrl":"10.1093/bioinformatics/btae668","url":null,"abstract":"<p><strong>Summary: </strong>Advanced genomic technologies have generated thousands of protein-nucleic acid binding datasets that have the potential to identify testable gene regulatory network (GRNs) models governed by combinatorial associations between factors. Transcription factors (TFs), and RNA binding proteins (RBPs) are nucleic-acid binding proteins regulating gene expression and are key drivers of GRN function. However, the combinatorial mechanisms by which the interactions between specific TFs and RBPs regulate gene expression remain largely unknown. To identify possible combinations of TFs and RBPs that may function together, developing a tool that compares and contrasts the interactions of multiple TFs and RBPs with nucleic acids to identify their common and unique targets is necessary. Therefore, we introduce BindCompare, a user-friendly tool that can be run locally to predict new combinatorial relationships between TFs and RBPs. BindCompare can analyze data from any organism with known annotated genome information and outputs files with detailed genomic locations and gene information for targets for downstream analysis. Overall, BindCompare is a new tool that identifies TFs and RBPs that co-bind to the same DNA and/or RNA loci, generating testable hypotheses about their combinatorial regulation of target genes.</p><p><strong>Availability and implementation: </strong>BindCompare is an open-source package that is available on the Python Packaging Index (PyPI, https://pypi.org/project/bindcompare/) with the source code available on GitHub (https://github.com/pranavmahabs/bindcompare). Complete documentation for the package can be found at both links.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Automated cell lineage tracing throughout embryogenesis plays a key role in the study of regulatory control of cell fate differentiation, morphogenesis and organogenesis in the development of animals, including nematode Caenorhabditis elegans. However, automated cell lineage tracing suffers from an exponential increase in errors at late embryo because of the dense distribution of cells, relatively low signal-to-noise ratio (SNR) and imbalanced intensity profiles of fluorescence images, which demands a huge amount of human effort to manually correct the errors. The existing image enhancement methods are not sensitive enough to deal with the challenges posed by the crowdedness and low signal-to-noise ratio. An alternative method is urgently needed to assist the existing detection methods in improving their detection and tracing accuracy, thereby reducing the huge burden for manual curation.
Results: We developed a new method, termed as DELICATE, that dramatically improves the accuracy of automated cell lineage tracing especially during the stage post 350 cells of C. elegans embryo. DELICATE works by increasing the local SNR and improving the evenness of nuclei fluorescence intensity across cells especially in the late embryos. The method both dramatically reduces the segmentation errors by StarryNite and the time required for manually correcting tracing errors up to 550-cell stage, allowing the generation of accurate cell lineage at large-scale with a user-friendly software/interface.
Availability and implementation: All images and data are available at https://doi.org/10.6084/m9.figshare.26778475.v1. The code and user-friendly software are available at https://github.com/plcx/NucApp-develop.
{"title":"Deep learning-based enhancement of fluorescence labeling for accurate cell lineage tracing during embryogenesis.","authors":"Zelin Li, Dongying Xie, Yiming Ma, Cunmin Zhao, Sicheng You, Hong Yan, Zhongying Zhao","doi":"10.1093/bioinformatics/btae626","DOIUrl":"10.1093/bioinformatics/btae626","url":null,"abstract":"<p><strong>Motivation: </strong>Automated cell lineage tracing throughout embryogenesis plays a key role in the study of regulatory control of cell fate differentiation, morphogenesis and organogenesis in the development of animals, including nematode Caenorhabditis elegans. However, automated cell lineage tracing suffers from an exponential increase in errors at late embryo because of the dense distribution of cells, relatively low signal-to-noise ratio (SNR) and imbalanced intensity profiles of fluorescence images, which demands a huge amount of human effort to manually correct the errors. The existing image enhancement methods are not sensitive enough to deal with the challenges posed by the crowdedness and low signal-to-noise ratio. An alternative method is urgently needed to assist the existing detection methods in improving their detection and tracing accuracy, thereby reducing the huge burden for manual curation.</p><p><strong>Results: </strong>We developed a new method, termed as DELICATE, that dramatically improves the accuracy of automated cell lineage tracing especially during the stage post 350 cells of C. elegans embryo. DELICATE works by increasing the local SNR and improving the evenness of nuclei fluorescence intensity across cells especially in the late embryos. The method both dramatically reduces the segmentation errors by StarryNite and the time required for manually correcting tracing errors up to 550-cell stage, allowing the generation of accurate cell lineage at large-scale with a user-friendly software/interface.</p><p><strong>Availability and implementation: </strong>All images and data are available at https://doi.org/10.6084/m9.figshare.26778475.v1. The code and user-friendly software are available at https://github.com/plcx/NucApp-develop.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549013/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae602
Clair S Gutierrez, Alia A Kassim, Benjamin D Gutierrez, Ronald T Raines
Motivation: Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success.
Results: We evaluated the use of known PTM sites in prediction via sequence-based deep learning algorithms. For each PTM, known locations of that PTM were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of that PTM at a given site. Without labeling known PTMs, our models are on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.
Availability and implementation: Sitetack is available as a web tool at https://sitetack.net; the source code, representative datasets, instructions for local use, and select models are available at https://github.com/clair-gutierrez/sitetack.
{"title":"Sitetack: a deep learning model that improves PTM prediction by using known PTMs.","authors":"Clair S Gutierrez, Alia A Kassim, Benjamin D Gutierrez, Ronald T Raines","doi":"10.1093/bioinformatics/btae602","DOIUrl":"10.1093/bioinformatics/btae602","url":null,"abstract":"<p><strong>Motivation: </strong>Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success.</p><p><strong>Results: </strong>We evaluated the use of known PTM sites in prediction via sequence-based deep learning algorithms. For each PTM, known locations of that PTM were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of that PTM at a given site. Without labeling known PTMs, our models are on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.</p><p><strong>Availability and implementation: </strong>Sitetack is available as a web tool at https://sitetack.net; the source code, representative datasets, instructions for local use, and select models are available at https://github.com/clair-gutierrez/sitetack.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Knowledge of the specific cell types affected by genetic alterations in rare diseases is crucial for advancing diagnostics and treatments. Despite significant progress, the cell types involved in the majority of rare disease manifestations remain largely unknown. In this study, we integrated scRNA-seq data from non-diseased samples with known genetic disorder genes and phenotypic information to predict the specific cell types disrupted by pathogenic mutations for 482 disease phenotypes.
Results: We found significant phenotype-cell type associations focusing on differential expression and co-expression mechanisms. Our analysis revealed that 13% of the associations documented in the literature were captured through differential expression, while 42% were elucidated through co-expression analysis, also uncovering potential new associations. These findings underscore the critical role of cellular context in disease manifestation and highlight the potential of single-cell data for the development of cell-aware diagnostics and targeted therapies for rare diseases.
Availability and implementation: All code generated in this work is available at https://github.com/SergioAlias/sc-coex.
{"title":"Differential expression and co-expression reveal cell types relevant to genetic disorder phenotypes.","authors":"Sergio Alías-Segura, Florencio Pazos, Monica Chagoyen","doi":"10.1093/bioinformatics/btae646","DOIUrl":"10.1093/bioinformatics/btae646","url":null,"abstract":"<p><strong>Motivation: </strong>Knowledge of the specific cell types affected by genetic alterations in rare diseases is crucial for advancing diagnostics and treatments. Despite significant progress, the cell types involved in the majority of rare disease manifestations remain largely unknown. In this study, we integrated scRNA-seq data from non-diseased samples with known genetic disorder genes and phenotypic information to predict the specific cell types disrupted by pathogenic mutations for 482 disease phenotypes.</p><p><strong>Results: </strong>We found significant phenotype-cell type associations focusing on differential expression and co-expression mechanisms. Our analysis revealed that 13% of the associations documented in the literature were captured through differential expression, while 42% were elucidated through co-expression analysis, also uncovering potential new associations. These findings underscore the critical role of cellular context in disease manifestation and highlight the potential of single-cell data for the development of cell-aware diagnostics and targeted therapies for rare diseases.</p><p><strong>Availability and implementation: </strong>All code generated in this work is available at https://github.com/SergioAlias/sc-coex.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549017/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142523810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1093/bioinformatics/btae650
Sam F L Windels, Daniel Tello Velasco, Mikhail Rotkevich, Noël Malod-Dognin, Nataša Pržulj
Motivation: Spatial Analysis of Functional Enrichment (SAFE) is a popular tool for biologists to investigate the functional organization of biological networks via highly intuitive 2D functional maps. To create these maps, SAFE uses Spring embedding to project a given network into a 2D space in which nodes connected in the network are near each other in space. However, many biological networks are scale-free, containing highly connected hub nodes. Because Spring embedding fails to separate hub nodes, it provides uninformative embeddings that resemble a 'hairball'. In addition, Spring embedding only captures direct node connectivity in the network and does not consider higher-order node wiring patterns, which are best captured by graphlets, small, connected, nonisomorphic, induced subgraphs. The scale-free structure of biological networks is hypothesized to stem from an underlying low-dimensional hyperbolic geometry, which novel hyperbolic embedding methods try to uncover. These include coalescent embedding, which projects a network onto a 2D disk.
Results: To better capture the functional organization of scale-free biological networks, whilst also going beyond simple direct connectivity patterns, we introduce Graphlet Coalescent (GraCoal) embedding, which embeds nodes nearby on a disk if they frequently co-occur on a given graphlet together. We use GraCoal to extend SAFE-based network analysis. Through SAFE-enabled enrichment analysis, we show that GraCoal outperforms graphlet-based Spring embedding in capturing the functional organization of the genetic interaction networks of fruit fly, budding yeast, fission yeast and Escherichia coli. We show that depending on the underlying graphlet, GraCoal embeddings capture different topology-function relationships. We show that triangle-based GraCoal embedding captures functional redundancies between paralogs.
Availability and implementation: https://gitlab.bsc.es/swindels/gracoal_embedding.
动机功能富集空间分析(Space Analysis of Functional Enrichment,SAFE)是生物学家常用的一种工具,可通过高度直观的二维功能图研究生物网络的功能组织。为了绘制这些地图,SAFE 使用 Spring embedding 技术将给定网络投影到二维空间中,在这个空间中,网络中相互连接的节点在空间上彼此靠近。然而,许多生物网络是无标度的,包含高度连接的中心节点。由于 Spring embedding 无法分离枢纽节点,因此它提供的嵌入信息并不丰富,就像一个 "毛球"。此外,Spring embedding 只能捕捉网络中的直接节点连通性,并没有考虑高阶节点布线模式,而这种模式最好通过小图(小的、连通的、非同构的诱导子图)来捕捉。据推测,生物网络的无标度结构源于潜在的低维双曲几何,新型双曲嵌入方法试图揭示这种结构。这些方法包括凝聚嵌入法,它将网络投射到二维圆盘上:为了更好地捕捉无标度生物网络的功能组织,同时超越简单的直接连接模式,我们引入了小图聚合嵌入(GraCoal),如果节点经常共同出现在给定的小图上,就将它们嵌入到圆盘上。我们使用 GraCoal 扩展基于 SAFE 的网络分析。通过基于 SAFE 的富集分析,我们发现 GraCoal 在捕捉果蝇、芽生酵母、裂殖酵母和大肠杆菌的遗传相互作用网络的功能组织方面优于基于小图的 Spring 嵌入。我们发现,根据底层小图的不同,GraCoal 嵌入捕捉到的拓扑-功能关系也不同。我们表明,基于三角形的 GraCoal 嵌入能捕捉到旁系亲属之间的功能冗余。可用性:https://gitlab.bsc.es/swindels/gracoal_embedding.Supplementary 信息:补充数据可在 Bioinformatics online 上获取。
{"title":"Graphlet-based hyperbolic embeddings capture evolutionary dynamics in genetic networks.","authors":"Sam F L Windels, Daniel Tello Velasco, Mikhail Rotkevich, Noël Malod-Dognin, Nataša Pržulj","doi":"10.1093/bioinformatics/btae650","DOIUrl":"10.1093/bioinformatics/btae650","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial Analysis of Functional Enrichment (SAFE) is a popular tool for biologists to investigate the functional organization of biological networks via highly intuitive 2D functional maps. To create these maps, SAFE uses Spring embedding to project a given network into a 2D space in which nodes connected in the network are near each other in space. However, many biological networks are scale-free, containing highly connected hub nodes. Because Spring embedding fails to separate hub nodes, it provides uninformative embeddings that resemble a 'hairball'. In addition, Spring embedding only captures direct node connectivity in the network and does not consider higher-order node wiring patterns, which are best captured by graphlets, small, connected, nonisomorphic, induced subgraphs. The scale-free structure of biological networks is hypothesized to stem from an underlying low-dimensional hyperbolic geometry, which novel hyperbolic embedding methods try to uncover. These include coalescent embedding, which projects a network onto a 2D disk.</p><p><strong>Results: </strong>To better capture the functional organization of scale-free biological networks, whilst also going beyond simple direct connectivity patterns, we introduce Graphlet Coalescent (GraCoal) embedding, which embeds nodes nearby on a disk if they frequently co-occur on a given graphlet together. We use GraCoal to extend SAFE-based network analysis. Through SAFE-enabled enrichment analysis, we show that GraCoal outperforms graphlet-based Spring embedding in capturing the functional organization of the genetic interaction networks of fruit fly, budding yeast, fission yeast and Escherichia coli. We show that depending on the underlying graphlet, GraCoal embeddings capture different topology-function relationships. We show that triangle-based GraCoal embedding captures functional redundancies between paralogs.</p><p><strong>Availability and implementation: </strong>https://gitlab.bsc.es/swindels/gracoal_embedding.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568109/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}