Pub Date : 2024-11-15eCollection Date: 2024-12-01DOI: 10.1093/nargab/lqae150
Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, Burkhard Rost
Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (ProstT5), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.
{"title":"Bilingual language model for protein sequence and structure.","authors":"Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, Burkhard Rost","doi":"10.1093/nargab/lqae150","DOIUrl":"10.1093/nargab/lqae150","url":null,"abstract":"<p><p>Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method <i>Foldseek</i>. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (<i>ProstT5</i>), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. <i>ProstT5</i> paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae150"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15eCollection Date: 2024-12-01DOI: 10.1093/nargab/lqae164
[This corrects the article DOI: 10.1093/nar/lqae063.].
[这更正了文章DOI: 10.1093/nar/lqae063.]。
{"title":"Correction to 'NFixDB (Nitrogen Fixation DataBase)-a comprehensive integrated database for robust 'omics analysis of diazotrophs'.","authors":"","doi":"10.1093/nargab/lqae164","DOIUrl":"10.1093/nargab/lqae164","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqae063.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae164"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616680/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15eCollection Date: 2024-12-01DOI: 10.1093/nargab/lqae160
Paula Pena González, Dafne Lozano-Paredes, José Luis Rojo-Álvarez, Luis Bote-Curiel, Víctor Javier Sánchez-Arévalo Lobo
The efficient importation of quantified gene expression data is pivotal in transcriptomics. Historically, the R package Tximport addressed this need by enabling seamless data integration from various quantification tools. However, the Python community lacked a corresponding tool, restricting cross-platform bioinformatics interoperability. We introduce Pymportx, a Python adaptation of Tximport, which replicates and extends the original package's functionalities. Pymportx maintains the integrity and accuracy of gene expression data while improving processing speed and integration within the Python ecosystem. It supports new data formats and includes tools for enhanced data exploration and analysis. Available under the MIT license, Pymportx integrates smoothly with Python's bioinformatics tools, facilitating a unified and efficient workflow across the R and Python ecosystems. This advancement not only broadens access to Python's extensive toolset but also fosters interdisciplinary collaboration and the development of cutting-edge bioinformatics analyses.
{"title":"Pymportx: facilitating next-generation transcriptomics analysis in Python.","authors":"Paula Pena González, Dafne Lozano-Paredes, José Luis Rojo-Álvarez, Luis Bote-Curiel, Víctor Javier Sánchez-Arévalo Lobo","doi":"10.1093/nargab/lqae160","DOIUrl":"10.1093/nargab/lqae160","url":null,"abstract":"<p><p>The efficient importation of quantified gene expression data is pivotal in transcriptomics. Historically, the R package Tximport addressed this need by enabling seamless data integration from various quantification tools. However, the Python community lacked a corresponding tool, restricting cross-platform bioinformatics interoperability. We introduce Pymportx, a Python adaptation of Tximport, which replicates and extends the original package's functionalities. Pymportx maintains the integrity and accuracy of gene expression data while improving processing speed and integration within the Python ecosystem. It supports new data formats and includes tools for enhanced data exploration and analysis. Available under the MIT license, Pymportx integrates smoothly with Python's bioinformatics tools, facilitating a unified and efficient workflow across the R and Python ecosystems. This advancement not only broadens access to Python's extensive toolset but also fosters interdisciplinary collaboration and the development of cutting-edge bioinformatics analyses.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae160"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616679/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-12eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae148
Hyunwook Koh
The effect of a treatment on a health or disease response can be modified by genetic or microbial variants. It is the matter of interaction effects between genetic or microbial variants and a treatment. To powerfully discover genetic or microbial biomarkers, it is crucial to incorporate such interaction effects in addition to the main effects. However, in the context of kernel machine regression analysis of its kind, existing methods cannot be utilized in a situation, where a kernel is available but its underlying real variants are unknown. To address such limitations, I introduce a general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects. It begins with extracting principal components from an input kernel through the singular value decomposition. Then, it employs the principal components as surrogate variants to construct three endogenous kernels for the main effects, interaction effects, and both of them, respectively. Hence, it works with a kernel as an input without knowing its underlying real variants, and also detects either the main effects, interaction effects, or both of them robustly. I also introduce its omnibus testing extension to multiple input kernels, named OmniK. I demonstrate its use for human microbiome studies.
{"title":"A general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects: Applications to human microbiome studies.","authors":"Hyunwook Koh","doi":"10.1093/nargab/lqae148","DOIUrl":"https://doi.org/10.1093/nargab/lqae148","url":null,"abstract":"<p><p>The effect of a treatment on a health or disease response can be modified by genetic or microbial variants. It is the matter of interaction effects between genetic or microbial variants and a treatment. To powerfully discover genetic or microbial biomarkers, it is crucial to incorporate such interaction effects in addition to the main effects. However, in the context of kernel machine regression analysis of its kind, existing methods cannot be utilized in a situation, where a kernel is available but its underlying real variants are unknown. To address such limitations, I introduce a general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects. It begins with extracting principal components from an input kernel through the singular value decomposition. Then, it employs the principal components as surrogate variants to construct three endogenous kernels for the main effects, interaction effects, and both of them, respectively. Hence, it works with a kernel as an input without knowing its underlying real variants, and also detects either the main effects, interaction effects, or both of them robustly. I also introduce its omnibus testing extension to multiple input kernels, named OmniK. I demonstrate its use for human microbiome studies.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae148"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding viral genome evolution during host infection is crucial for grasping viral diversity and evolution. Analyzing intra-host single nucleotide variants (iSNVs) offers insights into new lineage emergence, which is important for predicting and mitigating future viral threats. Despite next-generation sequencing's potential, challenges persist, notably sequencing artifacts leading to false iSNVs. We developed a workflow to enhance iSNV detection in large NGS libraries, using over 130 000 SARS-CoV-2 libraries to distinguish mutations from errors. Our approach integrates bioinformatics protocols, stringent quality control, and dimensionality reduction to tackle batch effects and improve mutation detection reliability. Additionally, we pioneer the application of the PHATE visualization approach to genomic data and introduce a methodology that quantifies how related groups of data points are represented within a two-dimensional space, enhancing clustering structure explanation based on genetic similarities. This workflow advances accurate intra-host mutation detection, facilitating a deeper understanding of viral diversity and evolution.
{"title":"Refining SARS-CoV-2 intra-host variation by leveraging large-scale sequencing data.","authors":"Fatima Mostefai, Jean-Christophe Grenier, Raphaël Poujol, Julie Hussin","doi":"10.1093/nargab/lqae145","DOIUrl":"https://doi.org/10.1093/nargab/lqae145","url":null,"abstract":"<p><p>Understanding viral genome evolution during host infection is crucial for grasping viral diversity and evolution. Analyzing intra-host single nucleotide variants (iSNVs) offers insights into new lineage emergence, which is important for predicting and mitigating future viral threats. Despite next-generation sequencing's potential, challenges persist, notably sequencing artifacts leading to false iSNVs. We developed a workflow to enhance iSNV detection in large NGS libraries, using over 130 000 SARS-CoV-2 libraries to distinguish mutations from errors. Our approach integrates bioinformatics protocols, stringent quality control, and dimensionality reduction to tackle batch effects and improve mutation detection reliability. Additionally, we pioneer the application of the PHATE visualization approach to genomic data and introduce a methodology that quantifies how related groups of data points are represented within a two-dimensional space, enhancing clustering structure explanation based on genetic similarities. This workflow advances accurate intra-host mutation detection, facilitating a deeper understanding of viral diversity and evolution.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae145"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-12eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae149
Xin Zeng, Fuki Gyoja, Yang Cui, Martin Loza, Takehiro G Kusakabe, Kenta Nakai
Despite known single-cell expression profiles in vertebrate retinas, understanding of their developmental and evolutionary expression patterns among homologous cell classes remains limited. We examined and compared approximately 240 000 retinal cells from four species and found significant similarities among homologous cell classes, indicating inherent regulatory patterns. To understand these shared patterns, we constructed gene regulatory networks for each developmental stage for three of these species. We identified 690 regulons governed by 530 regulators across three species, along with 10 common cell class-specific regulators and 16 highly preserved regulons. RNA velocity analysis pinpointed conserved putative driver genes and regulators to retinal cell differentiation in both mouse and zebrafish. Investigation of the origins of retinal cells by examining conserved expression patterns between vertebrate retinal cells and invertebrate Ciona intestinalis photoreceptor-related cells implied functional similarities in light transduction mechanisms. Our findings offer insights into the evolutionarily conserved regulatory frameworks and differentiation drivers of vertebrate retinal cells.
{"title":"Comparative single-cell transcriptomic analysis reveals putative differentiation drivers and potential origin of vertebrate retina.","authors":"Xin Zeng, Fuki Gyoja, Yang Cui, Martin Loza, Takehiro G Kusakabe, Kenta Nakai","doi":"10.1093/nargab/lqae149","DOIUrl":"https://doi.org/10.1093/nargab/lqae149","url":null,"abstract":"<p><p>Despite known single-cell expression profiles in vertebrate retinas, understanding of their developmental and evolutionary expression patterns among homologous cell classes remains limited. We examined and compared approximately 240 000 retinal cells from four species and found significant similarities among homologous cell classes, indicating inherent regulatory patterns. To understand these shared patterns, we constructed gene regulatory networks for each developmental stage for three of these species. We identified 690 regulons governed by 530 regulators across three species, along with 10 common cell class-specific regulators and 16 highly preserved regulons. RNA velocity analysis pinpointed conserved putative driver genes and regulators to retinal cell differentiation in both mouse and zebrafish. Investigation of the origins of retinal cells by examining conserved expression patterns between vertebrate retinal cells and invertebrate <i>Ciona intestinalis</i> photoreceptor-related cells implied functional similarities in light transduction mechanisms. Our findings offer insights into the evolutionarily conserved regulatory frameworks and differentiation drivers of vertebrate retinal cells.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae149"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555436/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae147
Huaming Sun, Diego A Vargas-Blanco, Ying Zhou, Catherine S Masiello, Jessica M Kelly, Justin K Moy, Dmitry Korkin, Scarlet S Shell
Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in Mycolicibacterium smegmatis in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.
{"title":"Diverse intrinsic properties shape transcript stability and stabilization in <i>Mycolicibacterium smegmatis</i>.","authors":"Huaming Sun, Diego A Vargas-Blanco, Ying Zhou, Catherine S Masiello, Jessica M Kelly, Justin K Moy, Dmitry Korkin, Scarlet S Shell","doi":"10.1093/nargab/lqae147","DOIUrl":"10.1093/nargab/lqae147","url":null,"abstract":"<p><p>Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in <i>Mycolicibacterium smegmatis</i> in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae147"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae151
Pedro L Baldoni, Lizhong Chen, Gordon K Smyth
This article further develops edgeR's divided-count approach for differential transcript expression (DTE) analysis of RNA-seq data to produce a faster and more accurate pipeline. The divided-count approach models the precision of transcript quantifications from the kallisto and Salmon software tools and divides the estimated overdispersions out of the transcript read counts, after which the divided-counts can be analysed by statistical tools developed for gene-level counts. This article adds three new refinements to the pipeline that dramatically decrease the computational overhead and storage requirements so that DTE analysis of very large datasets becomes practical. The new pipeline replaces bootstrap with Gibbs resampling and replaces edgeR v3 with v4. Both of these changes improve statistical power and accuracy and provide better resolution for low-count transcripts. The accuracy of overdispersion estimation is shown to depend on the total number of resamples across the whole dataset rather than on individual samples, dramatically reducing the recommended number of technical samples for large datasets. Test data and extensive simulations data show that the new pipeline is more powerful and efficient than previous DTE pipelines while providing correct control of the false discovery rate for any sample size.
{"title":"Faster and more accurate assessment of differential transcript expression with Gibbs sampling and edgeR v4.","authors":"Pedro L Baldoni, Lizhong Chen, Gordon K Smyth","doi":"10.1093/nargab/lqae151","DOIUrl":"10.1093/nargab/lqae151","url":null,"abstract":"<p><p>This article further develops edgeR's divided-count approach for differential transcript expression (DTE) analysis of RNA-seq data to produce a faster and more accurate pipeline. The divided-count approach models the precision of transcript quantifications from the kallisto and Salmon software tools and divides the estimated overdispersions out of the transcript read counts, after which the divided-counts can be analysed by statistical tools developed for gene-level counts. This article adds three new refinements to the pipeline that dramatically decrease the computational overhead and storage requirements so that DTE analysis of very large datasets becomes practical. The new pipeline replaces bootstrap with Gibbs resampling and replaces edgeR v3 with v4. Both of these changes improve statistical power and accuracy and provide better resolution for low-count transcripts. The accuracy of overdispersion estimation is shown to depend on the total number of resamples across the whole dataset rather than on individual samples, dramatically reducing the recommended number of technical samples for large datasets. Test data and extensive simulations data show that the new pipeline is more powerful and efficient than previous DTE pipelines while providing correct control of the false discovery rate for any sample size.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae151"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae146
Iñaki Sasiain, Deborah F Nacer, Mattias Aine, Srinivas Veerla, Johan Staaf
Epigenetic deregulation through altered DNA methylation is a fundamental feature of tumorigenesis, but tumor data from bulk tissue samples contain different proportions of malignant and non-malignant cells that may confound the interpretation of DNA methylation values. The adjustment of DNA methylation data based on tumor purity has been proposed to render both genome-wide and gene-specific analyses more precise, but it requires sample purity estimates. Here we present PureBeta, a single-sample statistical framework that uses genome-wide DNA methylation data to first estimate sample purity and then adjust methylation values of individual CpGs to correct for sample impurity. Purity values estimated with the algorithm have high correlation (>0.8) to reference values obtained from DNA sequencing when applied to samples from breast carcinoma, lung adenocarcinoma, and lung squamous cell carcinoma. Methylation beta values adjusted based on purity estimates have a more binary distribution that better reflects theoretical methylation states, thus facilitating improved biological inference as shown for BRCA1 in breast cancer. PureBeta is a versatile tool that can be used for different Illumina DNA methylation arrays and can be applied to individual samples of different cancer types to enhance biological interpretability of methylation data.
通过改变 DNA 甲基化实现表观遗传学失调是肿瘤发生的一个基本特征,但来自大量组织样本的肿瘤数据包含不同比例的恶性和非恶性细胞,这可能会混淆 DNA 甲基化值的解释。有人提出根据肿瘤纯度调整 DNA 甲基化数据,使全基因组和基因特异性分析更加精确,但这需要对样本纯度进行估计。在这里,我们介绍一种单样本统计框架 PureBeta,它使用全基因组 DNA 甲基化数据首先估算样本纯度,然后调整单个 CpGs 的甲基化值以校正样本不纯度。在应用于乳腺癌、肺腺癌和肺鳞癌样本时,用该算法估算的纯度值与 DNA 测序获得的参考值具有很高的相关性(>0.8)。根据纯度估计值调整的甲基化贝塔值具有更二元的分布,能更好地反映理论上的甲基化状态,从而有助于改进生物学推断,如乳腺癌中 BRCA1 的情况所示。PureBeta 是一种多功能工具,可用于不同的 Illumina DNA 甲基化阵列,并可应用于不同癌症类型的个体样本,以提高甲基化数据的生物学可解释性。
{"title":"Tumor purity estimated from bulk DNA methylation can be used for adjusting beta values of individual samples to better reflect tumor biology.","authors":"Iñaki Sasiain, Deborah F Nacer, Mattias Aine, Srinivas Veerla, Johan Staaf","doi":"10.1093/nargab/lqae146","DOIUrl":"10.1093/nargab/lqae146","url":null,"abstract":"<p><p>Epigenetic deregulation through altered DNA methylation is a fundamental feature of tumorigenesis, but tumor data from bulk tissue samples contain different proportions of malignant and non-malignant cells that may confound the interpretation of DNA methylation values. The adjustment of DNA methylation data based on tumor purity has been proposed to render both genome-wide and gene-specific analyses more precise, but it requires sample purity estimates. Here we present PureBeta, a single-sample statistical framework that uses genome-wide DNA methylation data to first estimate sample purity and then adjust methylation values of individual CpGs to correct for sample impurity. Purity values estimated with the algorithm have high correlation (>0.8) to reference values obtained from DNA sequencing when applied to samples from breast carcinoma, lung adenocarcinoma, and lung squamous cell carcinoma. Methylation beta values adjusted based on purity estimates have a more binary distribution that better reflects theoretical methylation states, thus facilitating improved biological inference as shown for <i>BRCA1</i> in breast cancer. PureBeta is a versatile tool that can be used for different Illumina DNA methylation arrays and can be applied to individual samples of different cancer types to enhance biological interpretability of methylation data.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae146"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24eCollection Date: 2024-09-01DOI: 10.1093/nargab/lqae143
Taylor O Eich, Collin A O'Leary, Walter N Moss
To address the lack of intronic reads in secondary structure probing data for the human MYC pre-mRNA, we developed a method that combines spliceosomal inhibition with RNA probing and sequencing. Here, the SIRP-seq method was applied to study the secondary structure of human MYC RNAs by chemically probing HeLa cells with dimethyl sulfate in the presence of the small molecule spliceosome inhibitor pladienolide B. Pladienolide B binds to the SF3B complex of the spliceosome to inhibit intron removal during splicing, resulting in retained intronic sequences. This method was used to increase the read coverage over intronic regions of MYC. The purpose for increasing coverage across introns was to generate complete reactivity profiles for intronic sequences via the DMS-MaPseq approach. Notably, depth was sufficient for analysis by the program DRACO, which was able to deduce distinct reactivity profiles and predict multiple secondary structural conformations as well as their suggested stoichiometric abundances. The results presented here provide a new method for intronic RNA secondary structural analyses, as well as specific structural insights relevant to MYC RNA splicing regulation and therapeutic targeting.
{"title":"Intronic RNA secondary structural information captured for the human <i>MYC</i> pre-mRNA.","authors":"Taylor O Eich, Collin A O'Leary, Walter N Moss","doi":"10.1093/nargab/lqae143","DOIUrl":"10.1093/nargab/lqae143","url":null,"abstract":"<p><p>To address the lack of intronic reads in secondary structure probing data for the human <i>MYC</i> pre-mRNA, we developed a method that combines spliceosomal inhibition with RNA probing and sequencing. Here, the SIRP-seq method was applied to study the secondary structure of human <i>MYC</i> RNAs by chemically probing HeLa cells with dimethyl sulfate in the presence of the small molecule spliceosome inhibitor pladienolide B. Pladienolide B binds to the SF3B complex of the spliceosome to inhibit intron removal during splicing, resulting in retained intronic sequences. This method was used to increase the read coverage over intronic regions of <i>MYC</i>. The purpose for increasing coverage across introns was to generate complete reactivity profiles for intronic sequences via the DMS-MaPseq approach. Notably, depth was sufficient for analysis by the program DRACO, which was able to deduce distinct reactivity profiles and predict multiple secondary structural conformations as well as their suggested stoichiometric abundances. The results presented here provide a new method for intronic RNA secondary structural analyses, as well as specific structural insights relevant to <i>MYC</i> RNA splicing regulation and therapeutic targeting.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae143"},"PeriodicalIF":4.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500451/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142509478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}