首页 > 最新文献

NAR Genomics and Bioinformatics最新文献

英文 中文
Bilingual language model for protein sequence and structure. 蛋白质序列和结构的双语语言模型。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-15 eCollection Date: 2024-12-01 DOI: 10.1093/nargab/lqae150
Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, Burkhard Rost

Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (ProstT5), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.

适应蛋白质序列的语言模型催生了强大的蛋白质语言模型(pLMs)的发展。同时,AlphaFold2在蛋白质结构预测方面取得突破。现在我们可以系统和全面地探索蛋白质的双重性质,这些蛋白质作为三维(3D)机器发挥作用和存在,并演变为一维(1D)序列的线性字符串。在这里,我们利用plm在单个模型中同时对两种模式进行建模。我们使用3d比对方法Foldseek引入的3di字母表将蛋白质结构编码为标记序列。对于训练,我们从AlphaFoldDB中构建了一个非冗余数据集,并对现有的pLM (ProtT5)进行了微调,以在3Di和氨基酸序列之间进行翻译。作为我们的新方法(称为蛋白质“结构序列”T5 (ProstT5))的概念验证,我们在随后的结构相关预测任务中表现出了改进的性能,导致导出3Di的速度提高了三个数量级。这将是至关重要的未来应用试图搜索宏基因组序列数据库在结构比较的敏感性。我们的工作展示了plm利用AlphaFold2推动的信息丰富的蛋白质结构革命的潜力。ProstT5为开发整合大量3D预测资源的新工具铺平了道路,并为后alphafold2时代开辟了新的研究途径。
{"title":"Bilingual language model for protein sequence and structure.","authors":"Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, Burkhard Rost","doi":"10.1093/nargab/lqae150","DOIUrl":"10.1093/nargab/lqae150","url":null,"abstract":"<p><p>Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method <i>Foldseek</i>. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (<i>ProstT5</i>), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. <i>ProstT5</i> paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae150"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to 'NFixDB (Nitrogen Fixation DataBase)-a comprehensive integrated database for robust 'omics analysis of diazotrophs'. 对“NFixDB(固氮数据库)”的修正-一个全面的集成数据库,用于强大的“重氮营养体组学分析”。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-15 eCollection Date: 2024-12-01 DOI: 10.1093/nargab/lqae164

[This corrects the article DOI: 10.1093/nar/lqae063.].

[这更正了文章DOI: 10.1093/nar/lqae063.]。
{"title":"Correction to 'NFixDB (Nitrogen Fixation DataBase)-a comprehensive integrated database for robust 'omics analysis of diazotrophs'.","authors":"","doi":"10.1093/nargab/lqae164","DOIUrl":"10.1093/nargab/lqae164","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqae063.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae164"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616680/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pymportx: facilitating next-generation transcriptomics analysis in Python. Pymportx:促进Python中的下一代转录组学分析。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-15 eCollection Date: 2024-12-01 DOI: 10.1093/nargab/lqae160
Paula Pena González, Dafne Lozano-Paredes, José Luis Rojo-Álvarez, Luis Bote-Curiel, Víctor Javier Sánchez-Arévalo Lobo

The efficient importation of quantified gene expression data is pivotal in transcriptomics. Historically, the R package Tximport addressed this need by enabling seamless data integration from various quantification tools. However, the Python community lacked a corresponding tool, restricting cross-platform bioinformatics interoperability. We introduce Pymportx, a Python adaptation of Tximport, which replicates and extends the original package's functionalities. Pymportx maintains the integrity and accuracy of gene expression data while improving processing speed and integration within the Python ecosystem. It supports new data formats and includes tools for enhanced data exploration and analysis. Available under the MIT license, Pymportx integrates smoothly with Python's bioinformatics tools, facilitating a unified and efficient workflow across the R and Python ecosystems. This advancement not only broadens access to Python's extensive toolset but also fosters interdisciplinary collaboration and the development of cutting-edge bioinformatics analyses.

在转录组学中,有效输入量化基因表达数据是关键。从历史上看,R包timport通过支持来自各种量化工具的无缝数据集成来解决这一需求。然而,Python社区缺乏相应的工具,限制了跨平台的生物信息学互操作性。我们介绍Pymportx,它是对ximport的Python改编版,它复制并扩展了原始包的功能。Pymportx保持基因表达数据的完整性和准确性,同时提高处理速度和Python生态系统内的集成。它支持新的数据格式,并包括用于增强数据探索和分析的工具。在MIT许可下,Pymportx与Python的生物信息学工具顺利集成,促进了R和Python生态系统之间统一高效的工作流程。这一进步不仅拓宽了Python广泛的工具集,还促进了跨学科合作和尖端生物信息学分析的发展。
{"title":"Pymportx: facilitating next-generation transcriptomics analysis in Python.","authors":"Paula Pena González, Dafne Lozano-Paredes, José Luis Rojo-Álvarez, Luis Bote-Curiel, Víctor Javier Sánchez-Arévalo Lobo","doi":"10.1093/nargab/lqae160","DOIUrl":"10.1093/nargab/lqae160","url":null,"abstract":"<p><p>The efficient importation of quantified gene expression data is pivotal in transcriptomics. Historically, the R package Tximport addressed this need by enabling seamless data integration from various quantification tools. However, the Python community lacked a corresponding tool, restricting cross-platform bioinformatics interoperability. We introduce Pymportx, a Python adaptation of Tximport, which replicates and extends the original package's functionalities. Pymportx maintains the integrity and accuracy of gene expression data while improving processing speed and integration within the Python ecosystem. It supports new data formats and includes tools for enhanced data exploration and analysis. Available under the MIT license, Pymportx integrates smoothly with Python's bioinformatics tools, facilitating a unified and efficient workflow across the R and Python ecosystems. This advancement not only broadens access to Python's extensive toolset but also fosters interdisciplinary collaboration and the development of cutting-edge bioinformatics analyses.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae160"},"PeriodicalIF":4.0,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616679/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142781318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects: Applications to human microbiome studies. 利用主成分分析联合测试主效应和交互效应的通用核机器回归框架:应用于人类微生物组研究。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-12 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae148
Hyunwook Koh

The effect of a treatment on a health or disease response can be modified by genetic or microbial variants. It is the matter of interaction effects between genetic or microbial variants and a treatment. To powerfully discover genetic or microbial biomarkers, it is crucial to incorporate such interaction effects in addition to the main effects. However, in the context of kernel machine regression analysis of its kind, existing methods cannot be utilized in a situation, where a kernel is available but its underlying real variants are unknown. To address such limitations, I introduce a general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects. It begins with extracting principal components from an input kernel through the singular value decomposition. Then, it employs the principal components as surrogate variants to construct three endogenous kernels for the main effects, interaction effects, and both of them, respectively. Hence, it works with a kernel as an input without knowing its underlying real variants, and also detects either the main effects, interaction effects, or both of them robustly. I also introduce its omnibus testing extension to multiple input kernels, named OmniK. I demonstrate its use for human microbiome studies.

基因或微生物变异可改变治疗对健康或疾病反应的影响。这就是基因或微生物变异与治疗之间的交互效应问题。要有力地发现基因或微生物生物标记物,除了主效应外,将这种交互效应纳入其中至关重要。然而,在核机器回归分析的背景下,现有的方法无法在核可用但其潜在真实变异未知的情况下使用。为了解决这种局限性,我介绍了一种使用主成分分析的通用核机器回归框架,用于联合测试主效应和交互效应。它首先通过奇异值分解从输入内核中提取主成分。然后,它利用主成分作为替代变量,分别为主要效应、交互效应和两者构建三个内生核。因此,它可以将内核作为输入,而无需知道其底层的真实变体,同时还能稳健地检测主效应、交互效应或两者。我还介绍了它对多个输入内核的综合测试扩展,命名为 OmniK。我演示了它在人类微生物组研究中的应用。
{"title":"A general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects: Applications to human microbiome studies.","authors":"Hyunwook Koh","doi":"10.1093/nargab/lqae148","DOIUrl":"https://doi.org/10.1093/nargab/lqae148","url":null,"abstract":"<p><p>The effect of a treatment on a health or disease response can be modified by genetic or microbial variants. It is the matter of interaction effects between genetic or microbial variants and a treatment. To powerfully discover genetic or microbial biomarkers, it is crucial to incorporate such interaction effects in addition to the main effects. However, in the context of kernel machine regression analysis of its kind, existing methods cannot be utilized in a situation, where a kernel is available but its underlying real variants are unknown. To address such limitations, I introduce a general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects. It begins with extracting principal components from an input kernel through the singular value decomposition. Then, it employs the principal components as surrogate variants to construct three endogenous kernels for the main effects, interaction effects, and both of them, respectively. Hence, it works with a kernel as an input without knowing its underlying real variants, and also detects either the main effects, interaction effects, or both of them robustly. I also introduce its omnibus testing extension to multiple input kernels, named OmniK. I demonstrate its use for human microbiome studies.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae148"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Refining SARS-CoV-2 intra-host variation by leveraging large-scale sequencing data. 利用大规模测序数据完善 SARS-CoV-2 宿主内变异。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-12 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae145
Fatima Mostefai, Jean-Christophe Grenier, Raphaël Poujol, Julie Hussin

Understanding viral genome evolution during host infection is crucial for grasping viral diversity and evolution. Analyzing intra-host single nucleotide variants (iSNVs) offers insights into new lineage emergence, which is important for predicting and mitigating future viral threats. Despite next-generation sequencing's potential, challenges persist, notably sequencing artifacts leading to false iSNVs. We developed a workflow to enhance iSNV detection in large NGS libraries, using over 130 000 SARS-CoV-2 libraries to distinguish mutations from errors. Our approach integrates bioinformatics protocols, stringent quality control, and dimensionality reduction to tackle batch effects and improve mutation detection reliability. Additionally, we pioneer the application of the PHATE visualization approach to genomic data and introduce a methodology that quantifies how related groups of data points are represented within a two-dimensional space, enhancing clustering structure explanation based on genetic similarities. This workflow advances accurate intra-host mutation detection, facilitating a deeper understanding of viral diversity and evolution.

了解宿主感染期间的病毒基因组进化对于把握病毒的多样性和进化至关重要。分析宿主内单核苷酸变体(iSNVs)可以深入了解新品系的出现,这对预测和减轻未来的病毒威胁非常重要。尽管下一代测序技术潜力巨大,但挑战依然存在,尤其是测序伪差导致的假iSNVs。我们开发了一种工作流程来提高大型 NGS 文库中 iSNV 的检测能力,利用超过 130,000 个 SARS-CoV-2 文库来区分突变和错误。我们的方法整合了生物信息学协议、严格的质量控制和降维技术,以解决批次效应并提高突变检测的可靠性。此外,我们开创性地将 PHATE 可视化方法应用于基因组数据,并引入一种方法来量化相关数据点群在二维空间中的表现形式,从而增强基于遗传相似性的聚类结构解释。这一工作流程提高了宿主内突变检测的准确性,有助于加深对病毒多样性和进化的理解。
{"title":"Refining SARS-CoV-2 intra-host variation by leveraging large-scale sequencing data.","authors":"Fatima Mostefai, Jean-Christophe Grenier, Raphaël Poujol, Julie Hussin","doi":"10.1093/nargab/lqae145","DOIUrl":"https://doi.org/10.1093/nargab/lqae145","url":null,"abstract":"<p><p>Understanding viral genome evolution during host infection is crucial for grasping viral diversity and evolution. Analyzing intra-host single nucleotide variants (iSNVs) offers insights into new lineage emergence, which is important for predicting and mitigating future viral threats. Despite next-generation sequencing's potential, challenges persist, notably sequencing artifacts leading to false iSNVs. We developed a workflow to enhance iSNV detection in large NGS libraries, using over 130 000 SARS-CoV-2 libraries to distinguish mutations from errors. Our approach integrates bioinformatics protocols, stringent quality control, and dimensionality reduction to tackle batch effects and improve mutation detection reliability. Additionally, we pioneer the application of the PHATE visualization approach to genomic data and introduce a methodology that quantifies how related groups of data points are represented within a two-dimensional space, enhancing clustering structure explanation based on genetic similarities. This workflow advances accurate intra-host mutation detection, facilitating a deeper understanding of viral diversity and evolution.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae145"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative single-cell transcriptomic analysis reveals putative differentiation drivers and potential origin of vertebrate retina. 单细胞转录组比较分析揭示了脊椎动物视网膜的推定分化驱动因素和潜在起源。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-12 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae149
Xin Zeng, Fuki Gyoja, Yang Cui, Martin Loza, Takehiro G Kusakabe, Kenta Nakai

Despite known single-cell expression profiles in vertebrate retinas, understanding of their developmental and evolutionary expression patterns among homologous cell classes remains limited. We examined and compared approximately 240 000 retinal cells from four species and found significant similarities among homologous cell classes, indicating inherent regulatory patterns. To understand these shared patterns, we constructed gene regulatory networks for each developmental stage for three of these species. We identified 690 regulons governed by 530 regulators across three species, along with 10 common cell class-specific regulators and 16 highly preserved regulons. RNA velocity analysis pinpointed conserved putative driver genes and regulators to retinal cell differentiation in both mouse and zebrafish. Investigation of the origins of retinal cells by examining conserved expression patterns between vertebrate retinal cells and invertebrate Ciona intestinalis photoreceptor-related cells implied functional similarities in light transduction mechanisms. Our findings offer insights into the evolutionarily conserved regulatory frameworks and differentiation drivers of vertebrate retinal cells.

尽管已知脊椎动物视网膜的单细胞表达谱,但对同源细胞类之间的发育和进化表达模式的了解仍然有限。我们对来自四个物种的约 24 万个视网膜细胞进行了研究和比较,发现同源细胞类别之间存在显著的相似性,表明了固有的调控模式。为了了解这些共享模式,我们构建了其中三个物种每个发育阶段的基因调控网络。我们在三个物种中发现了由 530 个调控因子调控的 690 个调控子,以及 10 个常见的细胞类特异性调控因子和 16 个高度保留的调控子。核糖核酸速度分析确定了小鼠和斑马鱼视网膜细胞分化的保守推定驱动基因和调节因子。通过研究脊椎动物视网膜细胞与无脊椎动物肠虫光感受器相关细胞之间的保守表达模式,对视网膜细胞的起源进行了调查,这意味着光传导机制在功能上具有相似性。我们的发现有助于深入了解脊椎动物视网膜细胞进化过程中保守的调控框架和分化驱动因素。
{"title":"Comparative single-cell transcriptomic analysis reveals putative differentiation drivers and potential origin of vertebrate retina.","authors":"Xin Zeng, Fuki Gyoja, Yang Cui, Martin Loza, Takehiro G Kusakabe, Kenta Nakai","doi":"10.1093/nargab/lqae149","DOIUrl":"https://doi.org/10.1093/nargab/lqae149","url":null,"abstract":"<p><p>Despite known single-cell expression profiles in vertebrate retinas, understanding of their developmental and evolutionary expression patterns among homologous cell classes remains limited. We examined and compared approximately 240 000 retinal cells from four species and found significant similarities among homologous cell classes, indicating inherent regulatory patterns. To understand these shared patterns, we constructed gene regulatory networks for each developmental stage for three of these species. We identified 690 regulons governed by 530 regulators across three species, along with 10 common cell class-specific regulators and 16 highly preserved regulons. RNA velocity analysis pinpointed conserved putative driver genes and regulators to retinal cell differentiation in both mouse and zebrafish. Investigation of the origins of retinal cells by examining conserved expression patterns between vertebrate retinal cells and invertebrate <i>Ciona intestinalis</i> photoreceptor-related cells implied functional similarities in light transduction mechanisms. Our findings offer insights into the evolutionarily conserved regulatory frameworks and differentiation drivers of vertebrate retinal cells.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae149"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11555436/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142629628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis. 多种内在特性决定了烟曲霉中转录本的稳定性和稳定性。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-04 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae147
Huaming Sun, Diego A Vargas-Blanco, Ying Zhou, Catherine S Masiello, Jessica M Kelly, Justin K Moy, Dmitry Korkin, Scarlet S Shell

Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in Mycolicibacterium smegmatis in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.

分枝杆菌通过调节转录本降解来促进对环境压力的适应。然而,这种调控的机制尚不清楚。在此,我们试图通过研究与转录本稳定性差异和应激诱导的转录本稳定相关的转录本特性,来了解控制 mRNA 稳定的机制。我们测量了烟曲霉菌在对数生长期和缺氧诱导生长停滞期整个转录组的 mRNA 半衰期。在低氧诱导下,转录组在整体上趋于稳定,但必需基因的转录物通常比非必需基因的转录物更稳定。然后,我们开发了机器学习模型,使我们能够识别一系列转录本特性对转录本稳定性和稳定化的非线性集体影响。我们确定了在对数相中对半衰期更有预测性的特性,以及在缺氧条件下更有预测性的特性,其中许多特性在有领导和无领导转录本之间存在差异。总之,我们发现转录本特性与转录本稳定性的关系因转录本类型和生长条件而异。我们的研究结果揭示了转录本特征与微环境之间复杂的相互作用,这种相互作用影响了分枝杆菌中转录本的稳定性。
{"title":"Diverse intrinsic properties shape transcript stability and stabilization in <i>Mycolicibacterium smegmatis</i>.","authors":"Huaming Sun, Diego A Vargas-Blanco, Ying Zhou, Catherine S Masiello, Jessica M Kelly, Justin K Moy, Dmitry Korkin, Scarlet S Shell","doi":"10.1093/nargab/lqae147","DOIUrl":"10.1093/nargab/lqae147","url":null,"abstract":"<p><p>Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in <i>Mycolicibacterium smegmatis</i> in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae147"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster and more accurate assessment of differential transcript expression with Gibbs sampling and edgeR v4. 利用 Gibbs 采样和 edgeR v4 更快、更准确地评估差异转录本表达。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-04 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae151
Pedro L Baldoni, Lizhong Chen, Gordon K Smyth

This article further develops edgeR's divided-count approach for differential transcript expression (DTE) analysis of RNA-seq data to produce a faster and more accurate pipeline. The divided-count approach models the precision of transcript quantifications from the kallisto and Salmon software tools and divides the estimated overdispersions out of the transcript read counts, after which the divided-counts can be analysed by statistical tools developed for gene-level counts. This article adds three new refinements to the pipeline that dramatically decrease the computational overhead and storage requirements so that DTE analysis of very large datasets becomes practical. The new pipeline replaces bootstrap with Gibbs resampling and replaces edgeR v3 with v4. Both of these changes improve statistical power and accuracy and provide better resolution for low-count transcripts. The accuracy of overdispersion estimation is shown to depend on the total number of resamples across the whole dataset rather than on individual samples, dramatically reducing the recommended number of technical samples for large datasets. Test data and extensive simulations data show that the new pipeline is more powerful and efficient than previous DTE pipelines while providing correct control of the false discovery rate for any sample size.

本文进一步开发了 edgeR 用于 RNA-seq 数据差异转录本表达(DTE)分析的分割计数法,以生成更快、更准确的管道。分割计数法对 kallisto 和 Salmon 软件工具的转录本定量精度进行建模,并将估计的过度分散从转录本读数计数中分割出来,然后用为基因水平计数开发的统计工具对分割计数进行分析。本文对这一流程进行了三项新的改进,大大降低了计算开销和存储要求,从而使超大数据集的 DTE 分析变得切实可行。新管道用吉布斯重采样取代了 bootstrap,用 v4 取代了 edgeR v3。这两项改动都提高了统计能力和准确性,并为低计数转录本提供了更好的分辨率。研究表明,过度分散估计的准确性取决于整个数据集的重采样总数,而不是单个样本,从而大大减少了大型数据集的建议技术样本数量。测试数据和大量模拟数据表明,新管道比以前的 DTE 管道更强大、更高效,同时能正确控制任何样本量的误发现率。
{"title":"Faster and more accurate assessment of differential transcript expression with Gibbs sampling and edgeR v4.","authors":"Pedro L Baldoni, Lizhong Chen, Gordon K Smyth","doi":"10.1093/nargab/lqae151","DOIUrl":"10.1093/nargab/lqae151","url":null,"abstract":"<p><p>This article further develops edgeR's divided-count approach for differential transcript expression (DTE) analysis of RNA-seq data to produce a faster and more accurate pipeline. The divided-count approach models the precision of transcript quantifications from the kallisto and Salmon software tools and divides the estimated overdispersions out of the transcript read counts, after which the divided-counts can be analysed by statistical tools developed for gene-level counts. This article adds three new refinements to the pipeline that dramatically decrease the computational overhead and storage requirements so that DTE analysis of very large datasets becomes practical. The new pipeline replaces bootstrap with Gibbs resampling and replaces edgeR v3 with v4. Both of these changes improve statistical power and accuracy and provide better resolution for low-count transcripts. The accuracy of overdispersion estimation is shown to depend on the total number of resamples across the whole dataset rather than on individual samples, dramatically reducing the recommended number of technical samples for large datasets. Test data and extensive simulations data show that the new pipeline is more powerful and efficient than previous DTE pipelines while providing correct control of the false discovery rate for any sample size.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae151"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tumor purity estimated from bulk DNA methylation can be used for adjusting beta values of individual samples to better reflect tumor biology. 根据大量 DNA 甲基化估计的肿瘤纯度可用于调整单个样本的 beta 值,以更好地反映肿瘤生物学特性。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-11-04 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae146
Iñaki Sasiain, Deborah F Nacer, Mattias Aine, Srinivas Veerla, Johan Staaf

Epigenetic deregulation through altered DNA methylation is a fundamental feature of tumorigenesis, but tumor data from bulk tissue samples contain different proportions of malignant and non-malignant cells that may confound the interpretation of DNA methylation values. The adjustment of DNA methylation data based on tumor purity has been proposed to render both genome-wide and gene-specific analyses more precise, but it requires sample purity estimates. Here we present PureBeta, a single-sample statistical framework that uses genome-wide DNA methylation data to first estimate sample purity and then adjust methylation values of individual CpGs to correct for sample impurity. Purity values estimated with the algorithm have high correlation (>0.8) to reference values obtained from DNA sequencing when applied to samples from breast carcinoma, lung adenocarcinoma, and lung squamous cell carcinoma. Methylation beta values adjusted based on purity estimates have a more binary distribution that better reflects theoretical methylation states, thus facilitating improved biological inference as shown for BRCA1 in breast cancer. PureBeta is a versatile tool that can be used for different Illumina DNA methylation arrays and can be applied to individual samples of different cancer types to enhance biological interpretability of methylation data.

通过改变 DNA 甲基化实现表观遗传学失调是肿瘤发生的一个基本特征,但来自大量组织样本的肿瘤数据包含不同比例的恶性和非恶性细胞,这可能会混淆 DNA 甲基化值的解释。有人提出根据肿瘤纯度调整 DNA 甲基化数据,使全基因组和基因特异性分析更加精确,但这需要对样本纯度进行估计。在这里,我们介绍一种单样本统计框架 PureBeta,它使用全基因组 DNA 甲基化数据首先估算样本纯度,然后调整单个 CpGs 的甲基化值以校正样本不纯度。在应用于乳腺癌、肺腺癌和肺鳞癌样本时,用该算法估算的纯度值与 DNA 测序获得的参考值具有很高的相关性(>0.8)。根据纯度估计值调整的甲基化贝塔值具有更二元的分布,能更好地反映理论上的甲基化状态,从而有助于改进生物学推断,如乳腺癌中 BRCA1 的情况所示。PureBeta 是一种多功能工具,可用于不同的 Illumina DNA 甲基化阵列,并可应用于不同癌症类型的个体样本,以提高甲基化数据的生物学可解释性。
{"title":"Tumor purity estimated from bulk DNA methylation can be used for adjusting beta values of individual samples to better reflect tumor biology.","authors":"Iñaki Sasiain, Deborah F Nacer, Mattias Aine, Srinivas Veerla, Johan Staaf","doi":"10.1093/nargab/lqae146","DOIUrl":"10.1093/nargab/lqae146","url":null,"abstract":"<p><p>Epigenetic deregulation through altered DNA methylation is a fundamental feature of tumorigenesis, but tumor data from bulk tissue samples contain different proportions of malignant and non-malignant cells that may confound the interpretation of DNA methylation values. The adjustment of DNA methylation data based on tumor purity has been proposed to render both genome-wide and gene-specific analyses more precise, but it requires sample purity estimates. Here we present PureBeta, a single-sample statistical framework that uses genome-wide DNA methylation data to first estimate sample purity and then adjust methylation values of individual CpGs to correct for sample impurity. Purity values estimated with the algorithm have high correlation (>0.8) to reference values obtained from DNA sequencing when applied to samples from breast carcinoma, lung adenocarcinoma, and lung squamous cell carcinoma. Methylation beta values adjusted based on purity estimates have a more binary distribution that better reflects theoretical methylation states, thus facilitating improved biological inference as shown for <i>BRCA1</i> in breast cancer. PureBeta is a versatile tool that can be used for different Illumina DNA methylation arrays and can be applied to individual samples of different cancer types to enhance biological interpretability of methylation data.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae146"},"PeriodicalIF":4.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11532792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intronic RNA secondary structural information captured for the human MYC pre-mRNA. 捕捉到的人类 MYC 前核糖核酸内部二级结构信息。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-10-24 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae143
Taylor O Eich, Collin A O'Leary, Walter N Moss

To address the lack of intronic reads in secondary structure probing data for the human MYC pre-mRNA, we developed a method that combines spliceosomal inhibition with RNA probing and sequencing. Here, the SIRP-seq method was applied to study the secondary structure of human MYC RNAs by chemically probing HeLa cells with dimethyl sulfate in the presence of the small molecule spliceosome inhibitor pladienolide B. Pladienolide B binds to the SF3B complex of the spliceosome to inhibit intron removal during splicing, resulting in retained intronic sequences. This method was used to increase the read coverage over intronic regions of MYC. The purpose for increasing coverage across introns was to generate complete reactivity profiles for intronic sequences via the DMS-MaPseq approach. Notably, depth was sufficient for analysis by the program DRACO, which was able to deduce distinct reactivity profiles and predict multiple secondary structural conformations as well as their suggested stoichiometric abundances. The results presented here provide a new method for intronic RNA secondary structural analyses, as well as specific structural insights relevant to MYC RNA splicing regulation and therapeutic targeting.

为了解决人类 MYC pre-mRNA 二级结构探测数据中缺乏内含子读数的问题,我们开发了一种将剪接体抑制与 RNA 探测和测序相结合的方法。Pladienolide B 与剪接体的 SF3B 复合物结合,抑制剪接过程中的内含子去除,从而保留了内含子序列。这种方法用于提高 MYC 内含子区域的读数覆盖率。提高内含子覆盖率的目的是通过 DMS-MaPseq 方法生成内含子序列的完整反应谱。值得注意的是,DRACO 程序的深度足以进行分析,该程序能够推导出不同的反应性曲线,并预测多种二级结构构象及其建议的化学丰度。本文介绍的结果为内含子 RNA 二级结构分析提供了一种新方法,也为 MYC RNA 剪接调控和靶向治疗提供了特定的结构见解。
{"title":"Intronic RNA secondary structural information captured for the human <i>MYC</i> pre-mRNA.","authors":"Taylor O Eich, Collin A O'Leary, Walter N Moss","doi":"10.1093/nargab/lqae143","DOIUrl":"10.1093/nargab/lqae143","url":null,"abstract":"<p><p>To address the lack of intronic reads in secondary structure probing data for the human <i>MYC</i> pre-mRNA, we developed a method that combines spliceosomal inhibition with RNA probing and sequencing. Here, the SIRP-seq method was applied to study the secondary structure of human <i>MYC</i> RNAs by chemically probing HeLa cells with dimethyl sulfate in the presence of the small molecule spliceosome inhibitor pladienolide B. Pladienolide B binds to the SF3B complex of the spliceosome to inhibit intron removal during splicing, resulting in retained intronic sequences. This method was used to increase the read coverage over intronic regions of <i>MYC</i>. The purpose for increasing coverage across introns was to generate complete reactivity profiles for intronic sequences via the DMS-MaPseq approach. Notably, depth was sufficient for analysis by the program DRACO, which was able to deduce distinct reactivity profiles and predict multiple secondary structural conformations as well as their suggested stoichiometric abundances. The results presented here provide a new method for intronic RNA secondary structural analyses, as well as specific structural insights relevant to <i>MYC</i> RNA splicing regulation and therapeutic targeting.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae143"},"PeriodicalIF":4.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500451/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142509478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
NAR Genomics and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1