Pub Date : 2024-08-29eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae118
Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera
ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.
{"title":"SpikeFlow: automated and flexible analysis of ChIP-Seq data with spike-in control.","authors":"Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera","doi":"10.1093/nargab/lqae118","DOIUrl":"https://doi.org/10.1093/nargab/lqae118","url":null,"abstract":"<p><p>ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae118"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae111
Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou
Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.
{"title":"Context-adjusted proportion of singletons (CAPS): a novel metric for assessing negative selection in the human genome.","authors":"Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou","doi":"10.1093/nargab/lqae111","DOIUrl":"https://doi.org/10.1093/nargab/lqae111","url":null,"abstract":"<p><p>Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae111"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae115
Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao
Hutchinson-Gilford Progeria Syndrome (HGPS) is a premature aging disease caused primarily by a C1824T mutation in LMNA. This mutation activates a cryptic splice donor site, producing a lamin variant called progerin. Interestingly, progerin has also been detected in cells and tissues of non-HGPS patients. Here, we investigated progerin expression using publicly available RNA-seq data from non-HGPS patients in the GTEx project. We found that progerin expression is present across all tissue types in non-HGPS patients and correlated with telomere shortening in the skin. Transcriptome-wide correlation analyses suggest that the level of progerin expression is correlated with switches in gene isoform expression patterns. Differential expression analyses show that progerin expression is correlated with significant changes in genes involved in splicing regulation and mitochondrial function. Interestingly, 5' splice sites whose use is correlated with progerin expression have significantly altered frequencies of consensus trinucleotides within the core 5' splice site. Furthermore, introns whose alternative splicing correlates with progerin have reduced GC content. Our study suggests that progerin expression in non-HGPS patients is part of a global shift in splicing patterns.
{"title":"Progerin mRNA expression in non-HGPS patients is correlated with widespread shifts in transcript isoforms.","authors":"Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao","doi":"10.1093/nargab/lqae115","DOIUrl":"https://doi.org/10.1093/nargab/lqae115","url":null,"abstract":"<p><p>Hutchinson-Gilford Progeria Syndrome (HGPS) is a premature aging disease caused primarily by a C1824T mutation in <i>LMNA</i>. This mutation activates a cryptic splice donor site, producing a lamin variant called progerin. Interestingly, progerin has also been detected in cells and tissues of non-HGPS patients. Here, we investigated progerin expression using publicly available RNA-seq data from non-HGPS patients in the GTEx project. We found that progerin expression is present across all tissue types in non-HGPS patients and correlated with telomere shortening in the skin. Transcriptome-wide correlation analyses suggest that the level of progerin expression is correlated with switches in gene isoform expression patterns. Differential expression analyses show that progerin expression is correlated with significant changes in genes involved in splicing regulation and mitochondrial function. Interestingly, 5' splice sites whose use is correlated with progerin expression have significantly altered frequencies of consensus trinucleotides within the core 5' splice site. Furthermore, introns whose alternative splicing correlates with progerin have reduced GC content. Our study suggests that progerin expression in non-HGPS patients is part of a global shift in splicing patterns.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae115"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358823/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In eukaryotes, genes produce a variety of distinct RNA isoforms, each with potentially unique protein products, coding potential or regulatory signals such as poly(A) tail and nucleotide modifications. Assessing the kinetics of RNA isoform metabolism, such as transcription and decay rates, is essential for unraveling gene regulation. However, it is currently impeded by lack of methods that can differentiate between individual isoforms. Here, we introduce RNAkinet, a deep convolutional and recurrent neural network, to detect nascent RNA molecules following metabolic labeling with the nucleoside analog 5-ethynyl uridine and long-read, direct RNA sequencing with nanopores. RNAkinet processes electrical signals from nanopore sequencing directly and distinguishes nascent from pre-existing RNA molecules. Our results show that RNAkinet prediction performance generalizes in various cell types and organisms and can be used to quantify RNA isoform half-lives. RNAkinet is expected to enable the identification of the kinetic parameters of RNA isoforms and to facilitate studies of RNA metabolism and the regulatory elements that influence it.
{"title":"Deep learning and direct sequencing of labeled RNA captures transcriptome dynamics.","authors":"Vlastimil Martinek, Jessica Martin, Cedric Belair, Matthew J Payea, Sulochan Malla, Panagiotis Alexiou, Manolis Maragkakis","doi":"10.1093/nargab/lqae116","DOIUrl":"10.1093/nargab/lqae116","url":null,"abstract":"<p><p>In eukaryotes, genes produce a variety of distinct RNA isoforms, each with potentially unique protein products, coding potential or regulatory signals such as poly(A) tail and nucleotide modifications. Assessing the kinetics of RNA isoform metabolism, such as transcription and decay rates, is essential for unraveling gene regulation. However, it is currently impeded by lack of methods that can differentiate between individual isoforms. Here, we introduce RNAkinet, a deep convolutional and recurrent neural network, to detect nascent RNA molecules following metabolic labeling with the nucleoside analog 5-ethynyl uridine and long-read, direct RNA sequencing with nanopores. RNAkinet processes electrical signals from nanopore sequencing directly and distinguishes nascent from pre-existing RNA molecules. Our results show that RNAkinet prediction performance generalizes in various cell types and organisms and can be used to quantify RNA isoform half-lives. RNAkinet is expected to enable the identification of the kinetic parameters of RNA isoforms and to facilitate studies of RNA metabolism and the regulatory elements that influence it.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae116"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae091
Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas
The discovery of functional long non-coding RNAs (lncRNAs) changed their initial concept as transcriptional noise. LncRNAs have been identified as regulators of multiple biological processes, including chromatin structure, gene expression, splicing, mRNA degradation, and translation. However, functional studies of lncRNAs are hindered by the usual lack of phenotypes upon deletion or inhibition. Here, we used Drosophila imaginal discs as a model system to identify lncRNAs involved in development and regeneration. We examined a subset of lncRNAs expressed in the wing, leg, and eye disc development. Additionally, we analyzed transcriptomic data from regenerating wing discs to profile the expression pattern of lncRNAs during tissue repair. We focused on the lncRNA CR40469, which is upregulated during regeneration. We generated CR40469 mutant flies that developed normally but showed impaired wing regeneration upon cell death induction. The ability of these mutants to regenerate was restored by the ectopic expression of CR40469. Furthermore, we found that the lncRNA CR34335 has a high degree of sequence similarity with CR40469 and can partially compensate for its function during regeneration in the absence of CR40469. Our findings point to a potential role of the lncRNA CR40469 in trans during the response to damage in the wing imaginal disc.
{"title":"Long non-coding RNAs involved in <i>Drosophila</i> development and regeneration.","authors":"Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas","doi":"10.1093/nargab/lqae091","DOIUrl":"10.1093/nargab/lqae091","url":null,"abstract":"<p><p>The discovery of functional long non-coding RNAs (lncRNAs) changed their initial concept as transcriptional noise. LncRNAs have been identified as regulators of multiple biological processes, including chromatin structure, gene expression, splicing, mRNA degradation, and translation. However, functional studies of lncRNAs are hindered by the usual lack of phenotypes upon deletion or inhibition. Here, we used <i>Drosophila</i> imaginal discs as a model system to identify lncRNAs involved in development and regeneration. We examined a subset of lncRNAs expressed in the wing, leg, and eye disc development. Additionally, we analyzed transcriptomic data from regenerating wing discs to profile the expression pattern of lncRNAs during tissue repair. We focused on the lncRNA <i>CR40469</i>, which is upregulated during regeneration. We generated <i>CR40469</i> mutant flies that developed normally but showed impaired wing regeneration upon cell death induction. The ability of these mutants to regenerate was restored by the ectopic expression of <i>CR40469</i>. Furthermore, we found that the lncRNA <i>CR34335</i> has a high degree of sequence similarity with <i>CR40469</i> and can partially compensate for its function during regeneration in the absence of <i>CR40469</i>. Our findings point to a potential role of the lncRNA <i>CR40469</i> in <i>trans</i> during the response to damage in the wing imaginal disc.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae091"},"PeriodicalIF":4.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327875/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142000843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae086
Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield
Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.
{"title":"Methods for evaluating unsupervised vector representations of genomic regions.","authors":"Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae086","DOIUrl":"10.1093/nargab/lqae086","url":null,"abstract":"<p><p>Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae086"},"PeriodicalIF":4.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-09eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae093
Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford
Alternative splicing (AS) is emerging as an important regulatory process for complex biological processes. Transcriptomic studies therefore commonly involve the identification and quantification of alternative processing events, but the need for predicting the functional consequences of changes to the relative inclusion of alternative events remains largely unaddressed. Many tools exist for the former task, albeit each constrained to its own event type definitions. Few tools exist for the latter task; each with significant limitations. To address these issues we developed junctionCounts, which captures both simple and complex pairwise AS events and quantifies them with straightforward exon-exon and exon-intron junction reads in RNA-seq data, performing competitively among similar tools in terms of sensitivity, false discovery rate and quantification accuracy. Its partner utility, cdsInsertion, identifies transcript coding sequence (CDS) information via in silico translation from annotated start codons, including the presence of premature termination codons. Finally, findSwitchEvents connects AS events with CDS information to predict the impact of individual events to the isoform-level CDS. We used junctionCounts to characterize splicing dynamics and NMD regulation during neuronal differentiation across four primates, demonstrating junctionCounts' capacity to robustly characterize AS in a variety of organisms and to predict its effect on mRNA isoform fate.
{"title":"junctionCounts: comprehensive alternative splicing analysis and prediction of isoform-level impacts to the coding sequence.","authors":"Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford","doi":"10.1093/nargab/lqae093","DOIUrl":"10.1093/nargab/lqae093","url":null,"abstract":"<p><p>Alternative splicing (AS) is emerging as an important regulatory process for complex biological processes. Transcriptomic studies therefore commonly involve the identification and quantification of alternative processing events, but the need for predicting the functional consequences of changes to the relative inclusion of alternative events remains largely unaddressed. Many tools exist for the former task, albeit each constrained to its own event type definitions. Few tools exist for the latter task; each with significant limitations. To address these issues we developed junctionCounts, which captures both simple and complex pairwise AS events and quantifies them with straightforward exon-exon and exon-intron junction reads in RNA-seq data, performing competitively among similar tools in terms of sensitivity, false discovery rate and quantification accuracy. Its partner utility, cdsInsertion, identifies transcript coding sequence (CDS) information via <i>in silico</i> translation from annotated start codons, including the presence of premature termination codons. Finally, findSwitchEvents connects AS events with CDS information to predict the impact of individual events to the isoform-level CDS. We used junctionCounts to characterize splicing dynamics and NMD regulation during neuronal differentiation across four primates, demonstrating junctionCounts' capacity to robustly characterize AS in a variety of organisms and to predict its effect on mRNA isoform fate.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae093"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-09eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae089
Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin
RNA secondary structures play essential roles in the formation of the tertiary structure and function of a transcript. Recent genome-wide studies highlight significant potential for RNA structures in the mammalian genome. However, a major challenge is assigning functional roles to these structured RNAs. In this study, we conduct a guilt-by-association analysis of clusters of computationally predicted conserved RNA structure (CRSs) in human untranslated regions (UTRs) to associate them with gene functions. We filtered a broad pool of ∼500 000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from the 5' and 3' UTR of protein-coding genes, respectively. We separately clustered these CRSs for both sets using RNAscClust, obtaining 793 and 2403 clusters, each containing an average of five CRSs per cluster. We identified overrepresented binding sites for 60 and 43 RNA-binding proteins co-localizing with the clustered CRSs. Furthermore, 104 and 441 clusters from the 5' and 3' UTRs, respectively, showed enrichment for various Gene Ontologies, including biological processes such as 'signal transduction', 'nervous system development', molecular functions like 'transferase activity' and the cellular components such as 'synapse' among others. Our study shows that significant functional insights can be gained by clustering RNA structures based on their structural characteristics.
{"title":"Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites.","authors":"Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin","doi":"10.1093/nargab/lqae089","DOIUrl":"10.1093/nargab/lqae089","url":null,"abstract":"<p><p>RNA secondary structures play essential roles in the formation of the tertiary structure and function of a transcript. Recent genome-wide studies highlight significant potential for RNA structures in the mammalian genome. However, a major challenge is assigning functional roles to these structured RNAs. In this study, we conduct a guilt-by-association analysis of clusters of computationally predicted conserved RNA structure (CRSs) in human untranslated regions (UTRs) to associate them with gene functions. We filtered a broad pool of ∼500 000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from the 5' and 3' UTR of protein-coding genes, respectively. We separately clustered these CRSs for both sets using RNAscClust, obtaining 793 and 2403 clusters, each containing an average of five CRSs per cluster. We identified overrepresented binding sites for 60 and 43 RNA-binding proteins co-localizing with the clustered CRSs. Furthermore, 104 and 441 clusters from the 5' and 3' UTRs, respectively, showed enrichment for various Gene Ontologies, including biological processes such as 'signal transduction', 'nervous system development', molecular functions like 'transferase activity' and the cellular components such as 'synapse' among others. Our study shows that significant functional insights can be gained by clustering RNA structures based on their structural characteristics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae089"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae085
Adam M B Allen, Anthony Maxwell
DNA topoisomerases (topos) are major targets for antimicrobial and chemotherapeutic drugs due to their fundamental roles in regulating DNA topology. Type II topos are essential for chromosome segregation and relaxing positive DNA supercoils, and are exemplified by topo II in eukaryotes, topo IV and DNA gyrase in bacteria, and topo VI in archaea. Topo VI occurs ubiquitously in plants and sporadically in bacteria, algae, and other protists and is highly homologous to Spo11, which initiates eukaryotic homologous recombination. This homology makes the two complexes difficult to distinguish by sequence and leads to discrepancies such as the identity of the putative topo VI in malarial Plasmodium species. A lack of understanding of the role and distribution of topo VI outside of archaea hampers its pursuit as a potential drug target, and the present study addresses this with an up-to-date and extensive phylogenetic analysis. We show that the A and B subunits of topo VI and Spo11 can be distinguished using phylogenetics and structural modelling, and that topo VI is not present in Plasmodium nor other members of the phylum Apicomplexa. These findings provide insights into the evolutionary relationships between topo VI and Spo11, and their adoption alongside other type II topos.
DNA 拓扑异构酶(拓扑酶)是抗菌药和化疗药的主要靶点,因为它们在调节 DNA 拓扑结构方面发挥着重要作用。真核生物中的拓扑 II、细菌中的拓扑 IV 和 DNA 回旋酶以及古细菌中的拓扑 VI 是拓扑 II 型拓扑酶的典型代表。Topo VI 在植物中普遍存在,在细菌、藻类和其他原生生物中也时有发生,它与启动真核生物同源重组的 Spo11 高度同源。这种同源性使得这两种复合体很难通过序列加以区分,并导致了一些差异,例如恶性疟原虫中的推定拓扑 VI 的身份。对拓扑 VI 在古细菌之外的作用和分布缺乏了解阻碍了将其作为潜在药物靶点的研究,本研究通过最新和广泛的系统发育分析解决了这一问题。我们的研究表明,通过系统发生学和结构建模,可以区分拓扑 VI 和 Spo11 的 A 和 B 亚基,而且拓扑 VI 不存在于疟原虫和棘球藻门的其他成员中。这些发现有助于深入了解拓扑 VI 和 Spo11 之间的进化关系,以及它们与其他 II 型拓扑的并存情况。
{"title":"Phylogenetic distribution of DNA topoisomerase VI and its distinction from SPO11.","authors":"Adam M B Allen, Anthony Maxwell","doi":"10.1093/nargab/lqae085","DOIUrl":"10.1093/nargab/lqae085","url":null,"abstract":"<p><p>DNA topoisomerases (topos) are major targets for antimicrobial and chemotherapeutic drugs due to their fundamental roles in regulating DNA topology. Type II topos are essential for chromosome segregation and relaxing positive DNA supercoils, and are exemplified by topo II in eukaryotes, topo IV and DNA gyrase in bacteria, and topo VI in archaea. Topo VI occurs ubiquitously in plants and sporadically in bacteria, algae, and other protists and is highly homologous to Spo11, which initiates eukaryotic homologous recombination. This homology makes the two complexes difficult to distinguish by sequence and leads to discrepancies such as the identity of the putative topo VI in malarial <i>Plasmodium</i> species. A lack of understanding of the role and distribution of topo VI outside of archaea hampers its pursuit as a potential drug target, and the present study addresses this with an up-to-date and extensive phylogenetic analysis. We show that the A and B subunits of topo VI and Spo11 can be distinguished using phylogenetics and structural modelling, and that topo VI is not present in <i>Plasmodium</i> nor other members of the phylum Apicomplexa. These findings provide insights into the evolutionary relationships between topo VI and Spo11, and their adoption alongside other type II topos.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae085"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae094
Chenguang Zhao, Tong Liu, Zheng Wang
Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.
{"title":"PANDA-3D: protein function prediction based on AlphaFold models.","authors":"Chenguang Zhao, Tong Liu, Zheng Wang","doi":"10.1093/nargab/lqae094","DOIUrl":"10.1093/nargab/lqae094","url":null,"abstract":"<p><p>Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae094"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}