Pub Date : 2024-08-10DOI: 10.1101/2024.08.09.607420
Yuqiao Gong, Zhangsheng Yu
Recent advancements in spatially resolved transcriptomics have provided a powerful means to comprehensively capture gene expression patterns while preserving the spatial context of the tissue microenvironment. Accurately deciphering the spatial context of spots within a tissue necessitates the careful utilization of their spatial information, which in turn requires feature extraction from complex and detailed spatial patterns. In this study, we present RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), a framework designed to learn low-dimensional representations of spatial transcriptome (ST) data. RGAST is the first to consider gene expression similarity and spatial neighbor relationships simultaneously in constructing a heterogeneous graph network in ST analysis. We further introduce a cross-attention mechanism to provide a more comprehensive and adaptive representation of spatial transcriptome data. We validate the effectiveness of RGAST in different downstream tasks using diverse spatial transcriptomics datasets obtained from different platforms with varying spatial resolutions. Our results demonstrate that RGAST enhances spatial domain identification accuracy by approximately 10% compared to the second method in 10X Visium DLPFC dataset. Furthermore, RGAST facilitates the discovery of spatially variable genes, uncovers spatially resolved cell-cell interactions, enables more precise cell trajectory inference and reveals intricate 3D spatial patterns across multiple sections of ST data. Our RGAST method is available as a Python package on PyPI at https://pypi.org/project/RGAST, free for academic use, and the source code is openly available from our GitHub repository at https://github.com/GYQ-form/RGAST.
{"title":"RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis","authors":"Yuqiao Gong, Zhangsheng Yu","doi":"10.1101/2024.08.09.607420","DOIUrl":"https://doi.org/10.1101/2024.08.09.607420","url":null,"abstract":"Recent advancements in spatially resolved transcriptomics have provided a powerful means to comprehensively capture gene expression patterns while preserving the spatial context of the tissue microenvironment. Accurately deciphering the spatial context of spots within a tissue necessitates the careful utilization of their spatial information, which in turn requires feature extraction from complex and detailed spatial patterns. In this study, we present RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), a framework designed to learn low-dimensional representations of spatial transcriptome (ST) data. RGAST is the first to consider gene expression similarity and spatial neighbor relationships simultaneously in constructing a heterogeneous graph network in ST analysis. We further introduce a cross-attention mechanism to provide a more comprehensive and adaptive representation of spatial transcriptome data. We validate the effectiveness of RGAST in different downstream tasks using diverse spatial transcriptomics datasets obtained from different platforms with varying spatial resolutions. Our results demonstrate that RGAST enhances spatial domain identification accuracy by approximately 10% compared to the second method in 10X Visium DLPFC dataset. Furthermore, RGAST facilitates the discovery of spatially variable genes, uncovers spatially resolved cell-cell interactions, enables more precise cell trajectory inference and reveals intricate 3D spatial patterns across multiple sections of ST data. Our RGAST method is available as a Python package on PyPI at https://pypi.org/project/RGAST, free for academic use, and the source code is openly available from our GitHub repository at https://github.com/GYQ-form/RGAST.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"2010 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.09.607371
Xue Zou, Zachary W. Gomez, Timothy E. Reddy, Andrew S. Allen, William H. Majoros
Motivation: Allele specific expression (ASE) analyses aim to detect imbalanced expression of maternal versus paternal copies of an autosomal gene. Such allelic imbalance can result from a variety of cis-acting causes, including disruptive mutations within one copy of a gene that impact the stability of transcripts, as well as regulatory variants outside the gene that impact transcription initiation. Current methods for ASE estimation suffer from a number of shortcomings, such as relying on only one variant within a gene, assuming perfect phasing information across multiple variants within a gene, or failing to account for alignment biases and possible genotyping errors. Results: We developed BEASTIE, a Bayesian hierarchical model designed for precise ASE quantification at the gene level, based on given genotypes and RNA-seq data. BEASTIE addresses the complexities of allelic mapping bias, genotyping error, and phasing errors by incorporating empirical phasing error rates derived from Genome-in-a-Bottle individual NA12878. BEASTIE surpasses existing methods in accuracy, especially in scenarios with high phasing errors. This improvement is critical for identifying rare genetic variants often obscured by such errors. Through rigorous validation on simulated data and application to real data from the 1000 Genomes Project, we establish the robustness of BEASTIE. These findings underscore the value of BEASTIE in revealing patterns of ASE across gene sets and pathways.
{"title":"Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty","authors":"Xue Zou, Zachary W. Gomez, Timothy E. Reddy, Andrew S. Allen, William H. Majoros","doi":"10.1101/2024.08.09.607371","DOIUrl":"https://doi.org/10.1101/2024.08.09.607371","url":null,"abstract":"Motivation: Allele specific expression (ASE) analyses aim to detect imbalanced expression of maternal versus paternal copies of an autosomal gene. Such allelic imbalance can result from a variety of cis-acting causes, including disruptive mutations within one copy of a gene that impact the stability of transcripts, as well as regulatory variants outside the gene that impact transcription initiation. Current methods for ASE estimation suffer from a number of shortcomings, such as relying on only one variant within a gene, assuming perfect phasing information across multiple variants within a gene, or failing to account for alignment biases and possible genotyping errors. Results: We developed BEASTIE, a Bayesian hierarchical model designed for precise ASE quantification at the gene level, based on given genotypes and RNA-seq data. BEASTIE addresses the complexities of allelic mapping bias, genotyping error, and phasing errors by incorporating empirical phasing error rates derived from Genome-in-a-Bottle individual NA12878. BEASTIE surpasses existing methods in accuracy, especially in scenarios with high phasing errors. This improvement is critical for identifying rare genetic variants often obscured by such errors. Through rigorous validation on simulated data and application to real data from the 1000 Genomes Project, we establish the robustness of BEASTIE. These findings underscore the value of BEASTIE in revealing patterns of ASE across gene sets and pathways.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.06.17.599219
Yue Hu, Svitlana Oleshko, Samuele Firmani, Zhaocheng Zhu, Hui Cheng, Maria Ulmer, Matthias Arnold, Maria Colome-Tatche, Jian Tang, Sophie Xhonneux, Annalisa Marsico
Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) methods are limited in capturing this complexity. Representation-based learning techniques improve prediction accuracy by mapping nodes to low-dimensional embeddings, yet they often struggle with interpretability and scalability. We present BioPathNet, a novel graph neural network framework based on the Neural Bellman-Ford Network (NBFNet), addressing these limitations through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability. This allows visualization of influential paths and facilitates biological validation. BioPathNet leverages a background regulatory graph (BRG) for enhanced message passing and uses stringent negative sampling to improve precision. In evaluations across various LP tasks, such as gene function annotation, drug-disease indication, synthetic lethality, and lncRNA-mRNA interaction prediction, BioPathNet consistently outperformed shallow node embedding methods, relational graph neural networks and task-specific state-of-the-art methods, demonstrating robust performance and versatility. Our study predicts novel drug indications for diseases like acute lymphoblastic leukemia (ALL) and Alzheimer's, validated by medical experts and clinical trials. We also identified new synthetic lethality gene pairs and regulatory interactions involving lncRNAs and target genes, confirmed through literature reviews. BioPathNet's interpretability will enable researchers to trace prediction paths and gain molecular insights, making it a valuable tool for drug discovery, personalized medicine and biology in general.
{"title":"Path-based reasoning in biomedical knowledge graphs with BioPathNet","authors":"Yue Hu, Svitlana Oleshko, Samuele Firmani, Zhaocheng Zhu, Hui Cheng, Maria Ulmer, Matthias Arnold, Maria Colome-Tatche, Jian Tang, Sophie Xhonneux, Annalisa Marsico","doi":"10.1101/2024.06.17.599219","DOIUrl":"https://doi.org/10.1101/2024.06.17.599219","url":null,"abstract":"Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) methods are limited in capturing this complexity. Representation-based learning techniques improve prediction accuracy by mapping nodes to low-dimensional embeddings, yet they often struggle with interpretability and scalability. We present BioPathNet, a novel graph neural network framework based on the Neural Bellman-Ford Network (NBFNet), addressing these limitations through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability. This allows visualization of influential paths and facilitates biological validation. BioPathNet leverages a background regulatory graph (BRG) for enhanced message passing and uses stringent negative sampling to improve precision. In evaluations across various LP tasks, such as gene function annotation, drug-disease indication, synthetic lethality, and lncRNA-mRNA interaction prediction, BioPathNet consistently outperformed shallow node embedding methods, relational graph neural networks and task-specific state-of-the-art methods, demonstrating robust performance and versatility. Our study predicts novel drug indications for diseases like acute lymphoblastic leukemia (ALL) and Alzheimer's, validated by medical experts and clinical trials. We also identified new synthetic lethality gene pairs and regulatory interactions involving lncRNAs and target genes, confirmed through literature reviews. BioPathNet's interpretability will enable researchers to trace prediction paths and gain molecular insights, making it a valuable tool for drug discovery, personalized medicine and biology in general.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.09.607400
Brian Gural, Logan Kirkland, Abbey Hockett, Peyton Sandroni, Jiandong Zhang, Manuel Rosa-Garrido, Samantha K Swift, Douglas Chapski, Michael A Flinn, Caitlin C O'Meara, Thomas M Vondriska, Michaela Patterson, Brian C Jensen, Christoph Rau
Background: Recent advances in single cell sequencing have led to an increased focus on the role of cell-type composition in phenotypic presentation and disease progression. Cell-type composition research in the heart is challenging due to large, frequently multinucleated cardiomyocytes that preclude most single cell approaches from obtaining accurate measurements of cell composition. Our in silico studies reveal that ignoring cell type composition when calculating differentially expressed genes (DEGs) can have significant consequences. For example, a relatively small change in cell abundance of only 10% can result in over 25% of DEGs being false positives. Methods: We have implemented an algorithmic approach that uses snRNAseq datasets as a reference to accurately calculate cell type compositions from bulk RNAseq datasets through robust data cleaning, gene selection, and multi-sample cross-subject and cross-cell-type deconvolution. We applied our approach to cardiomyocyte-specific α1A adrenergic receptor (CM-α1A-AR) knockout mice. 8-12 week-old mice (either WT or CM-α1A-KO) were subjected to permanent left coronary artery (LCA) ligation or sham surgery (n=4 per group). Transcriptomes from the infarct border zones were collected 3 days later and analyzed using our algorithm to determine cell-type abundances, corrected differential expression calculations using DESeq2, and validated these findings using RNAscope. Results: Uncorrected DEGs for the CM-α1A-KO X LCA interaction term featured many cell-type specific genes such as Timp4 (fibroblasts) and Aplnr (cardiomyocytes) and overall GO enrichment for terms pertaining to cardiomyocyte differentiation (P=3.1E-4). Using our algorithm, we observe a striking loss of cardiomyocytes and gain in fibroblasts in the α1A-KO + LCA mice that was not recapitulated in WT + LCA animals, although we did observe a similar increase in macrophage abundance in both conditions. This recapitulates prior results that showed a much more severe heart failure phenotype in CM-α1A-KO + LCA mice. Following correction for cell-type, our DEGs now highlight a novel set of genes enriched for GO terms such as cardiac contraction (P=3.7E-5) and actin filament organization (P=6.3E-5). Conclusions: Our algorithm identifies and corrects for cell-type abundance in bulk RNAseq datasets opening new avenues for research on novel genes and pathways as well as an improved understanding of the role of cardiac cell types in cardiovascular disease.
{"title":"Novel Insights into Post-Myocardial Infarction Cardiac Remodeling through Algorithmic Detection of Cell-Type Composition Shifts","authors":"Brian Gural, Logan Kirkland, Abbey Hockett, Peyton Sandroni, Jiandong Zhang, Manuel Rosa-Garrido, Samantha K Swift, Douglas Chapski, Michael A Flinn, Caitlin C O'Meara, Thomas M Vondriska, Michaela Patterson, Brian C Jensen, Christoph Rau","doi":"10.1101/2024.08.09.607400","DOIUrl":"https://doi.org/10.1101/2024.08.09.607400","url":null,"abstract":"Background: Recent advances in single cell sequencing have led to an increased focus on the role of cell-type composition in phenotypic presentation and disease progression. Cell-type composition research in the heart is challenging due to large, frequently multinucleated cardiomyocytes that preclude most single cell approaches from obtaining accurate measurements of cell composition. Our in silico studies reveal that ignoring cell type composition when calculating differentially expressed genes (DEGs) can have significant consequences. For example, a relatively small change in cell abundance of only 10% can result in over 25% of DEGs being false positives. Methods: We have implemented an algorithmic approach that uses snRNAseq datasets as a reference to accurately calculate cell type compositions from bulk RNAseq datasets through robust data cleaning, gene selection, and multi-sample cross-subject and cross-cell-type deconvolution. We applied our approach to cardiomyocyte-specific α1A adrenergic receptor (CM-α1A-AR) knockout mice. 8-12 week-old mice (either WT or CM-α1A-KO) were subjected to permanent left coronary artery (LCA) ligation or sham surgery (n=4 per group). Transcriptomes from the infarct border zones were collected 3 days later and analyzed using our algorithm to determine cell-type abundances, corrected differential expression calculations using DESeq2, and validated these findings using RNAscope. Results: Uncorrected DEGs for the CM-α1A-KO X LCA interaction term featured many cell-type specific genes such as Timp4 (fibroblasts) and Aplnr (cardiomyocytes) and overall GO enrichment for terms pertaining to cardiomyocyte differentiation (P=3.1E-4). Using our algorithm, we observe a striking loss of cardiomyocytes and gain in fibroblasts in the α1A-KO + LCA mice that was not recapitulated in WT + LCA animals, although we did observe a similar increase in macrophage abundance in both conditions. This recapitulates prior results that showed a much more severe heart failure phenotype in CM-α1A-KO + LCA mice. Following correction for cell-type, our DEGs now highlight a novel set of genes enriched for GO terms such as cardiac contraction (P=3.7E-5) and actin filament organization (P=6.3E-5). Conclusions: Our algorithm identifies and corrects for cell-type abundance in bulk RNAseq datasets opening new avenues for research on novel genes and pathways as well as an improved understanding of the role of cardiac cell types in cardiovascular disease.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.09.607368
Xin Dai, Max Henderson, Shinjae Yoo, Qun Liu
Metals are essential elements in all living organisms, binding to approximately 50% of proteins. They serve to stabilize proteins, catalyze reactions, regulate activities, and fulfill various physiological and pathological functions. While there have been many advancements in determining the structures of protein-metal complexes, numerous metal-binding proteins still need to be identified through computational methods and validated through experiments. To address this need, we have developed the ESMBind-based workflow, which combines evolutionary scale modeling (ESM) for metal-binding prediction and physics-based protein-metal modeling. Our approach utilizes the ESM-2 and ESM-IF models to predict metal-binding probability at the residue level. In addition, we have designed a metal-placement method and energy minimization technique to generate detailed 3D structures of protein-metal complexes. Our workflow outperforms other models in terms of residue and 3D-level predictions. To demonstrate its effectiveness, we applied the workflow to 142 uncharacterized fungal pathogen proteins and predicted metal-binding proteins involved in fungal infection and virulence.
{"title":"Predict metal-binding proteins and structures through integration of evolutionary-scale and physics-based modeling","authors":"Xin Dai, Max Henderson, Shinjae Yoo, Qun Liu","doi":"10.1101/2024.08.09.607368","DOIUrl":"https://doi.org/10.1101/2024.08.09.607368","url":null,"abstract":"Metals are essential elements in all living organisms, binding to approximately 50% of proteins. They serve to stabilize proteins, catalyze reactions, regulate activities, and fulfill various physiological and pathological functions. While there have been many advancements in determining the structures of protein-metal complexes, numerous metal-binding proteins still need to be identified through computational methods and validated through experiments. To address this need, we have developed the ESMBind-based workflow, which combines evolutionary scale modeling (ESM) for metal-binding prediction and physics-based protein-metal modeling. Our approach utilizes the ESM-2 and ESM-IF models to predict metal-binding probability at the residue level. In addition, we have designed a metal-placement method and energy minimization technique to generate detailed 3D structures of protein-metal complexes. Our workflow outperforms other models in terms of residue and 3D-level predictions. To demonstrate its effectiveness, we applied the workflow to 142 uncharacterized fungal pathogen proteins and predicted metal-binding proteins involved in fungal infection and virulence.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.10.607395
Robinson Vidva, Mir Abbas Raza, Jaswant Prabhakaran, Ayesha Sheikh, Alaina Sharp, Hayden Ott, Amelia Moore, Christopher Fleisher, Pothitos M. Pitychoutis, Tam V. Nguyen, Aaron Sathyanesan
Research-animal colony management is a critical determinant of scientific productivity in labs conducting preclinical research. However, labs generally use ad hoc paper-based or spreadsheet-based methods to manage animal colonies. Current software-based solutions are limited based on cost, ease of implementation and deployment, and lack of remote access. In addition, most current solutions lack integration of realtime monitoring of ambient variables that affect colony wellbeing and breeding efficiency. Keeping these functionalities in mind, we built MyVivarium - a cloud-based animal colony database management web application that can be easily deployed and sustained for a cost comparable to starting and maintaining a lab website. For mouse colony management, MyVivarium allows for the tracking of individual mice or litters within holding-cage and breeding-cage records by multiple users at designated levels of access. Physical identities of cages are mapped onto the database using printable cage-card templates with cage-specific QR codes, enabling quick record updating using mobile devices. Tasks can be easily assigned with reminders for experiments or cage maintenance. Finally, MyVivarium integrates near-realtime internet-of-things (IoT)-based ambient sensing using low-cost open-source hardware to track humidity, temperature, vivarium-worker activity, and room illuminance. Taken together, MyVivarium is a novel open-source cloud-based application template that can serve as a low-cost, simple, and efficient solution for digital management of research animal colonies.
{"title":"MyVivarium: A cloud-based lab animal colony management application with near-realtime ambient sensing","authors":"Robinson Vidva, Mir Abbas Raza, Jaswant Prabhakaran, Ayesha Sheikh, Alaina Sharp, Hayden Ott, Amelia Moore, Christopher Fleisher, Pothitos M. Pitychoutis, Tam V. Nguyen, Aaron Sathyanesan","doi":"10.1101/2024.08.10.607395","DOIUrl":"https://doi.org/10.1101/2024.08.10.607395","url":null,"abstract":"Research-animal colony management is a critical determinant of scientific productivity in labs conducting preclinical research. However, labs generally use <em>ad hoc</em> paper-based or spreadsheet-based methods to manage animal colonies. Current software-based solutions are limited based on cost, ease of implementation and deployment, and lack of remote access. In addition, most current solutions lack integration of realtime monitoring of ambient variables that affect colony wellbeing and breeding efficiency. Keeping these functionalities in mind, we built <em>MyVivarium</em> - a cloud-based animal colony database management web application that can be easily deployed and sustained for a cost comparable to starting and maintaining a lab website. For mouse colony management, MyVivarium allows for the tracking of individual mice or litters within holding-cage and breeding-cage records by multiple users at designated levels of access. Physical identities of cages are mapped onto the database using printable cage-card templates with cage-specific QR codes, enabling quick record updating using mobile devices. Tasks can be easily assigned with reminders for experiments or cage maintenance. Finally, MyVivarium integrates near-realtime internet-of-things (IoT)-based ambient sensing using low-cost open-source hardware to track humidity, temperature, vivarium-worker activity, and room illuminance. Taken together, MyVivarium is a novel open-source cloud-based application template that can serve as a low-cost, simple, and efficient solution for digital management of research animal colonies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human Papillomavirus 18 (HPV 18) is known as a high-risk variant associated with cervical and anogenital malignancies. High-risk types HPV 18 and HPV 16 (human papillomavirus 16) play a major part in about 70 percent of cervical cancer worldwide (Ramakrishnan et al., 2015). The L1 protein of HPV 18 (HPV 18s L1 protein), also known as major capsid L1 protein is targeted in the vaccine development against HPV 18 due to its non-oncogenic and non-infectious properties with self-assembly ability into virus-like particles. In the present analysis, an extensive codon usage bias analysis of HPV 18s L1 protein and adaptation to its host human was conducted. The Effective number (Nc) Grand Average of Hydropathy (GRAVY), Index of Aromaticity (AROMO), and Codon Bias Index (CBI) values revealed no biases in codon usage of HPV 18s L1 protein. The data of the Codon Adaptation Index (CAI), and Relative Codon Deoptimization Index (RCDI) indicate adaptation of HPV 18s L1 protein according to its host human. The domination of selection pressure on codon usage of HPV 18s L1 protein was demonstrated based on GC12 vs GC3, Nc vs GC3, and frequency of optimal codons (FOP). The Parity plot revealed that the genome of HPV 18s L1 protein has a preference for purine over pyrimidine, that is G nucleotides over C, and no preference for A over T but A/T richness was observed in the genome of HPV 18s L1 protein. In the Nucleotide composition, GC1 richness ultimately represents evolutionary aspects of codon usage. Furthermore, these findings can be used in currently ongoing vaccine development and gene therapy to design viral vectors.
{"title":"Codon Usage Bias Analysis of Human Papillomavirus 18s L1 Protein and its Host Adaptability","authors":"Vinaya Vinod Shinde, Swati Bankariya, Parminder Kaur","doi":"10.1101/2024.08.10.607454","DOIUrl":"https://doi.org/10.1101/2024.08.10.607454","url":null,"abstract":"Human Papillomavirus 18 (HPV 18) is known as a high-risk variant associated with cervical and anogenital malignancies. High-risk types HPV 18 and HPV 16 (human papillomavirus 16) play a major part in about 70 percent of cervical cancer worldwide (Ramakrishnan et al., 2015). The L1 protein of HPV 18 (HPV 18s L1 protein), also known as major capsid L1 protein is targeted in the vaccine development against HPV 18 due to its non-oncogenic and non-infectious properties with self-assembly ability into virus-like particles. In the present analysis, an extensive codon usage bias analysis of HPV 18s L1 protein and adaptation to its host human was conducted. The Effective number (Nc) Grand Average of Hydropathy (GRAVY), Index of Aromaticity (AROMO), and Codon Bias Index (CBI) values revealed no biases in codon usage of HPV 18s L1 protein. The data of the Codon Adaptation Index (CAI), and Relative Codon Deoptimization Index (RCDI) indicate adaptation of HPV 18s L1 protein according to its host human. The domination of selection pressure on codon usage of HPV 18s L1 protein was demonstrated based on GC12 vs GC3, Nc vs GC3, and frequency of optimal codons (FOP). The Parity plot revealed that the genome of HPV 18s L1 protein has a preference for purine over pyrimidine, that is G nucleotides over C, and no preference for A over T but A/T richness was observed in the genome of HPV 18s L1 protein. In the Nucleotide composition, GC1 richness ultimately represents evolutionary aspects of codon usage. Furthermore, these findings can be used in currently ongoing vaccine development and gene therapy to design viral vectors.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.10.607413
Imad Abugessaisa, Akira Hasegawa, Scott Walker, Shintaro Katayama, Juha Kere, Takeya Kasukawa
In droplet-based Chromium single cell gene expression data by the 10x Genomics platform, cell barcode calling by Cell Ranger (CR) is a standard pipeline. However, no systematic evaluation of the impact of the released versions of CR on Chromium single cell gene expression data has been conducted. To comprehensively evaluate the impact of CR, we considered six molecular quality criteria, quantified gene expression, and performed downstream analysis for 12 single-cell Chromium gene expression datasets. Each dataset was processed by 10 versions of CR resulting in 180 datasets and a total of 702,493 cell barcodes. We demonstrated that different versions of CR yield different numbers of cell barcodes with significant variation in molecular qualities and average gene expression for the same dataset. Our analysis finds distinction between two diverse categories of cell barcodes: common barcodes called (unmasked) by all versions of CR, and specific barcodes only called (unmasked/masked) by some versions. Surprisingly, we observed variations in molecular quality indices between common cell barcodes when called by different versions of CR. The specific barcodes yield skewed gene body coverage and form distinct clusters at the edges of UMAP plots. The choice of CR version affects scores for quality, average gene expression, clustering results, and top cluster marker genes for each dataset. Our study indicates a demonstrable, quantitative effect on downstream analysis from choice of CR version, resulting in widely different Chromium single cell gene expression data for different CR versions.
{"title":"Impacts of Cell Ranger versions on Chromium gene expression data","authors":"Imad Abugessaisa, Akira Hasegawa, Scott Walker, Shintaro Katayama, Juha Kere, Takeya Kasukawa","doi":"10.1101/2024.08.10.607413","DOIUrl":"https://doi.org/10.1101/2024.08.10.607413","url":null,"abstract":"In droplet-based Chromium single cell gene expression data by the 10x Genomics platform, cell barcode calling by Cell Ranger (CR) is a standard pipeline. However, no systematic evaluation of the impact of the released versions of CR on Chromium single cell gene expression data has been conducted. To comprehensively evaluate the impact of CR, we considered six molecular quality criteria, quantified gene expression, and performed downstream analysis for 12 single-cell Chromium gene expression datasets. Each dataset was processed by 10 versions of CR resulting in 180 datasets and a total of 702,493 cell barcodes. We demonstrated that different versions of CR yield different numbers of cell barcodes with significant variation in molecular qualities and average gene expression for the same dataset. Our analysis finds distinction between two diverse categories of cell barcodes: common barcodes called (unmasked) by all versions of CR, and specific barcodes only called (unmasked/masked) by some versions. Surprisingly, we observed variations in molecular quality indices between common cell barcodes when called by different versions of CR. The specific barcodes yield skewed gene body coverage and form distinct clusters at the edges of UMAP plots. The choice of CR version affects scores for quality, average gene expression, clustering results, and top cluster marker genes for each dataset. Our study indicates a demonstrable, quantitative effect on downstream analysis from choice of CR version, resulting in widely different Chromium single cell gene expression data for different CR versions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.10.607430
Marco Nicolini, Emanuele Saitto, Ruben E Jimenez Franco, Emanuele Cavalleri, Marco Mesiti, Aldo J Galeano Alfonso, Dario Malchiodi, Alberto Paccanaro, Peter N Robinson, Elena Casiraghi, Giorgio Valentini
We introduce Finenzyme, a Protein Language Model (PLM) that employs a multifaceted learning strategy based on transfer learning from a decoder-based Transformer, conditional learning using specific functional keywords, and fine-tuning to model specific Enzyme Commission (EC) categories. Using Finenzyme, we investigate the conditions under which fine-tuning enhances the prediction and generation of EC categories, showing a two-fold perplexity improvement in EC-specific categories compared to a generalist model. Our extensive experimentation shows that Finenzyme generated sequences can be very different from natural ones while retaining similar tertiary structures, functions and chemical kinetics of their natural counterparts. Importantly, the embedded representations of the generated enzymes closely resemble those of natural ones, thus making them suitable for downstream tasks. Finally, we illustrate how Finenzyme can be used in practice to generate enzymes characterized by specific functions using in-silico directed evolution, a computationally inexpensive PLM fine-tuning procedure significantly enhancing and assisting targeted enzyme engineering tasks.
{"title":"Fine-tuning of conditional Transformers for the generation of functionally characterized enzymes","authors":"Marco Nicolini, Emanuele Saitto, Ruben E Jimenez Franco, Emanuele Cavalleri, Marco Mesiti, Aldo J Galeano Alfonso, Dario Malchiodi, Alberto Paccanaro, Peter N Robinson, Elena Casiraghi, Giorgio Valentini","doi":"10.1101/2024.08.10.607430","DOIUrl":"https://doi.org/10.1101/2024.08.10.607430","url":null,"abstract":"We introduce Finenzyme, a Protein Language Model (PLM) that employs a multifaceted learning strategy based on transfer learning from a decoder-based Transformer, conditional learning using specific functional keywords, and fine-tuning to model specific Enzyme Commission (EC) categories. Using Finenzyme, we investigate the conditions under which fine-tuning enhances the prediction and generation of EC categories, showing a two-fold perplexity improvement in EC-specific categories compared to a generalist model. Our extensive experimentation shows that Finenzyme generated sequences can be very different from natural ones while retaining similar tertiary structures, functions and chemical kinetics of their natural counterparts. Importantly, the embedded representations of the generated enzymes closely resemble those of natural ones, thus making them suitable for downstream tasks. Finally, we illustrate how Finenzyme can be used in practice to generate enzymes characterized by specific functions using in-silico directed evolution, a computationally inexpensive PLM fine-tuning procedure significantly enhancing and assisting targeted enzyme engineering tasks.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1101/2024.08.10.606103
Yifan Li, Xiaoyong Pan, Hong-Bin Shen
Nuclear localization signals (NLSs) are essential peptide fragments within proteins that play a decisive role in guiding proteins into the cell nucleus. Determining the existence and precise locations of NLSs experimentally is time-consuming and complicated, resulting in a scarcity of experimentally validated NLS fragments. Consequently, annotated NLS datasets are relatively small, presenting challenges for data-driven methods. In this study, we propose an innovative interpretable approach, NLSExplorer, which leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network for NLS identification. By utilizing the knowledge retrieved from protein language models, NLSExplorer achieves superior predictive performance compared to existing methods on two NLS benchmark datasets. Additionally, NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals. We employ NLSExplorer to investigate potential NLSs and other domains that are important for nuclear transport in nucleus-localized proteins within the Swiss-Prot database. Further comprehensive pattern analysis for all these segments uncovers a potential NLS space and internal relationship of important nuclear transport segments for 416 species. This study not only introduces a powerful tool for predicting and exploring NLS space, but also offers a versatile network that detects characteristic domains and motifs of NLSs.
{"title":"Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units","authors":"Yifan Li, Xiaoyong Pan, Hong-Bin Shen","doi":"10.1101/2024.08.10.606103","DOIUrl":"https://doi.org/10.1101/2024.08.10.606103","url":null,"abstract":"Nuclear localization signals (NLSs) are essential peptide fragments within proteins that play a decisive role in guiding proteins into the cell nucleus. Determining the existence and precise locations of NLSs experimentally is time-consuming and complicated, resulting in a scarcity of experimentally validated NLS fragments. Consequently, annotated NLS datasets are relatively small, presenting challenges for data-driven methods. In this study, we propose an innovative interpretable approach, NLSExplorer, which leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network for NLS identification. By utilizing the knowledge retrieved from protein language models, NLSExplorer achieves superior predictive performance compared to existing methods on two NLS benchmark datasets. Additionally, NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals. We employ NLSExplorer to investigate potential NLSs and other domains that are important for nuclear transport in nucleus-localized proteins within the Swiss-Prot database. Further comprehensive pattern analysis for all these segments uncovers a potential NLS space and internal relationship of important nuclear transport segments for 416 species. This study not only introduces a powerful tool for predicting and exploring NLS space, but also offers a versatile network that detects characteristic domains and motifs of NLSs.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}