首页 > 最新文献

NAR Genomics and Bioinformatics最新文献

英文 中文
Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases. 为复杂疾病的转录组生物标记物发现优化混合集合特征选择策略。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-07-11 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae079
Elsa Claude, Mickaël Leclercq, Patricia Thébault, Arnaud Droit, Raluca Uricaru

Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.

生物医学研究利用转录组学等 omic 数据来揭示疾病的复杂性。传统的策略是通过特征选择方法来识别转录组生物标记物,这些标记物的特征是与表型相关的表达模式。混合集合特征选择(HEFS)通过执行数据和功能扰动来确保所选特征的稳健性,因此越来越受欢迎。然而,在设计这种方法时,仍然很难在每一步做出最合适的选择。我们对四种可能的 HEFS 方案进行了广泛分析,以便从转录组数据中识别 IV 期结直肠癌、I 期肾癌、I 期肺癌和 III 期子宫内膜癌生物标记物。这些方案研究了通过过滤器(差异表达基因和方差)和两种重采样策略(分布均衡分层重复保留和随机分层)减少特征的两种类型的使用,以便通过数千个包装机器学习模型的聚合进行下游特征选择。基于我们的研究结果,我们强调了使用 HEFS 方法识别复杂疾病生物标志物的优势,因为这种方法能够对数据和功能扰动产生可推广的稳定结果。最后,我们强调了在设计此类策略时需要考虑的关键问题。
{"title":"Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases.","authors":"Elsa Claude, Mickaël Leclercq, Patricia Thébault, Arnaud Droit, Raluca Uricaru","doi":"10.1093/nargab/lqae079","DOIUrl":"10.1093/nargab/lqae079","url":null,"abstract":"<p><p>Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae079"},"PeriodicalIF":4.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11237901/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141591574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. 使用预训练嵌入对 scATAC 数据进行快速聚类和细胞类型注释。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-07-05 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae073
Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

利用测序技术对转座酶可进入染色质进行单细胞检测(scATAC-seq)所获得的数据现已广泛应用。一个主要的计算挑战是处理高维度和固有的稀疏性,通常是通过为下游聚类任务生成较低维度的单细胞表示来解决这一问题。目前的方法是通过一步学习过程直接生成这种单细胞嵌入。在这里,我们提出了另一种方法,即在参考数据上建立预先训练好的嵌入模型。我们认为,这提供了一种更灵活的分析工作流程,而且通过迁移学习还具有计算性能优势。我们在 scEmbed 中实现了我们的方法,这是一个无监督机器学习框架,它学习基因组调控区域的低维嵌入,以表示和分析 scATAC-seq 数据。scEmbed 在聚类能力方面表现出色,其关键优势在于学习区域共现模式,并可将其迁移到其他未见过的数据集。此外,在参考数据上预先训练的模型可用于构建快速、准确的细胞类型注释系统,而无需其他数据模式。 scEmbed 用 Python 实现,可从 GitHub 下载。我们还在 huggingface 上提供预训练模型供公众使用。scEmbed 是开源的,可从 https://github.com/databio/geniml 上获取。这项工作的预训练模型可在 huggingface 上获取:https://huggingface.co/databio。
{"title":"Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.","authors":"Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae073","DOIUrl":"10.1093/nargab/lqae073","url":null,"abstract":"<p><p>Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae073"},"PeriodicalIF":4.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141555620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synteruptor: mining genomic islands for non-classical specialized metabolite gene clusters. Synteruptor:挖掘基因组岛屿上的非典型特化代谢物基因簇。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-06-19 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae069
Drago Haas, Matthieu Barba, Cláudia M Vicente, Šarká Nezbedová, Amélie Garénaux, Stéphanie Bury-Moné, Jean-Noël Lorenzi, Laurence Hôtel, Luisa Laureti, Annabelle Thibessard, Géraldine Le Goff, Jamal Ouazzani, Pierre Leblond, Bertrand Aigle, Jean-Luc Pernodet, Olivier Lespinet, Sylvie Lautru

Microbial specialized metabolite biosynthetic gene clusters (SMBGCs) are a formidable source of natural products of pharmaceutical interest. With the multiplication of genomic data available, very efficient bioinformatic tools for automatic SMBGC detection have been developed. Nevertheless, most of these tools identify SMBGCs based on sequence similarity with enzymes typically involved in specialised metabolism and thus may miss SMBGCs coding for undercharacterised enzymes. Here we present Synteruptor (https://bioi2.i2bc.paris-saclay.fr/synteruptor), a program that identifies genomic islands, known to be enriched in SMBGCs, in the genomes of closely related species. With this tool, we identified a SMBGC in the genome of Streptomyces ambofaciens ATCC23877, undetected by antiSMASH versions prior to antiSMASH 5, and experimentally demonstrated that it directs the biosynthesis of two metabolites, one of which was identified as sphydrofuran. Synteruptor is also a valuable resource for the delineation of individual SMBGCs within antiSMASH regions that may encompass multiple clusters, and for refining the boundaries of these SMBGCs.

微生物特异代谢物生物合成基因簇(SMBGC)是具有制药价值的天然产物的重要来源。随着基因组数据的成倍增加,用于自动检测 SMBGC 的高效生物信息学工具也应运而生。不过,这些工具大多根据与通常参与专门代谢的酶的序列相似性来识别 SMBGC,因此可能会漏掉编码未充分定性酶的 SMBGC。在这里,我们介绍 Synteruptor (https://bioi2.i2bc.paris-saclay.fr/synteruptor),它是一个在近缘物种基因组中识别已知富含 SMBGCs 的基因组岛的程序。利用该工具,我们在安博法氏链霉菌(Streptomyces ambofaciens)ATCC23877的基因组中发现了一个SMBGC,而在antiSMASH 5之前的antiSMASH版本未检测到该SMBGC,实验证明该SMBGC指导了两种代谢物的生物合成,其中一种被鉴定为sphydrofuran。Synteruptor 也是一种宝贵的资源,可用于在可能包含多个群集的 antiSMASH 区域内划分单个 SMBGC,并完善这些 SMBGC 的边界。
{"title":"Synteruptor: mining genomic islands for non-classical specialized metabolite gene clusters.","authors":"Drago Haas, Matthieu Barba, Cláudia M Vicente, Šarká Nezbedová, Amélie Garénaux, Stéphanie Bury-Moné, Jean-Noël Lorenzi, Laurence Hôtel, Luisa Laureti, Annabelle Thibessard, Géraldine Le Goff, Jamal Ouazzani, Pierre Leblond, Bertrand Aigle, Jean-Luc Pernodet, Olivier Lespinet, Sylvie Lautru","doi":"10.1093/nargab/lqae069","DOIUrl":"10.1093/nargab/lqae069","url":null,"abstract":"<p><p>Microbial specialized metabolite biosynthetic gene clusters (SMBGCs) are a formidable source of natural products of pharmaceutical interest. With the multiplication of genomic data available, very efficient bioinformatic tools for automatic SMBGC detection have been developed. Nevertheless, most of these tools identify SMBGCs based on sequence similarity with enzymes typically involved in specialised metabolism and thus may miss SMBGCs coding for undercharacterised enzymes. Here we present Synteruptor (https://bioi2.i2bc.paris-saclay.fr/synteruptor), a program that identifies genomic islands, known to be enriched in SMBGCs, in the genomes of closely related species. With this tool, we identified a SMBGC in the genome of <i>Streptomyces ambofaciens</i> ATCC23877, undetected by antiSMASH versions prior to antiSMASH 5, and experimentally demonstrated that it directs the biosynthesis of two metabolites, one of which was identified as sphydrofuran. Synteruptor is also a valuable resource for the delineation of individual SMBGCs within antiSMASH regions that may encompass multiple clusters, and for refining the boundaries of these SMBGCs.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae069"},"PeriodicalIF":4.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11195616/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141447204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Morphological and dietary changes encoded in the genome of Beroe ovata, a ctenophore-eating ctenophore. 食栉水母 Beroe ovata 基因组编码的形态和食物变化。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-18 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae072
Alexandra M Vargas, Melissa B DeBiasse, Lana L Dykes, Allison Edgar, T Danielle Hayes, Daniel J Groso, Leslie S Babonis, Mark Q Martindale, Joseph F Ryan

As the sister group to all other animals, ctenophores (comb jellies) are important for understanding the emergence and diversification of numerous animal traits. Efforts to explore the evolutionary processes that promoted diversification within Ctenophora are hindered by undersampling genomic diversity within this clade. To address this gap, we present the sequence, assembly and initial annotation of the genome of Beroe ovata. Beroe possess unique morphology, behavior, ecology and development. Unlike their generalist carnivorous kin, beroid ctenophores feed exclusively on other ctenophores. Accordingly, our analyses revealed a loss of chitinase, an enzyme critical for the digestion of most non-ctenophore prey, but superfluous for ctenophorivores. Broadly, our genomic analysis revealed that extensive gene loss and changes in gene regulation have shaped the unique biology of B. ovata. Despite the gene losses in B. ovata, our phylogenetic analyses on photosensitive opsins and several early developmental regulatory genes show that these genes are conserved in B. ovata. This additional sampling contributes to a more complete reconstruction of the ctenophore ancestor and points to the need for extensive comparisons within this ancient and diverse clade of animals. To promote further exploration of these data, we present BovaDB (http://ryanlab.whitney.ufl.edu/bovadb/), a portal for the B. ovata genome.

作为所有其他动物的姊妹群,栉水母(梳水母)对于了解众多动物特征的出现和多样化非常重要。由于对栉水母类基因组多样性的取样不足,阻碍了探索促进该类群多样化的进化过程。为了填补这一空白,我们对 Beroe ovata 的基因组进行了测序、组装和初步注释。Beroe 具有独特的形态、行为、生态和发育。与它们的食肉类同族不同,栉水母只以其他栉水母为食。因此,我们的分析发现了几丁质酶的缺失,这种酶对大多数非栉水母猎物的消化至关重要,但对栉水母来说却是多余的。总体而言,我们的基因组分析表明,大量基因缺失和基因调控的变化形成了卵形栉水母独特的生物学特性。尽管卵形栉水母的基因大量丢失,但我们对感光蛋白和几个早期发育调控基因的系统进化分析表明,这些基因在卵形栉水母中是保守的。这些额外的取样有助于更完整地重建栉水母的祖先,并表明有必要在这一古老而多样的动物支系中进行广泛的比较。为了促进对这些数据的进一步探索,我们推出了卵形栉水母基因组门户网站 BovaDB (http://ryanlab.whitney.ufl.edu/bovadb/)。
{"title":"Morphological and dietary changes encoded in the genome of <i>Beroe ovata</i>, a ctenophore-eating ctenophore.","authors":"Alexandra M Vargas, Melissa B DeBiasse, Lana L Dykes, Allison Edgar, T Danielle Hayes, Daniel J Groso, Leslie S Babonis, Mark Q Martindale, Joseph F Ryan","doi":"10.1093/nargab/lqae072","DOIUrl":"10.1093/nargab/lqae072","url":null,"abstract":"<p><p>As the sister group to all other animals, ctenophores (comb jellies) are important for understanding the emergence and diversification of numerous animal traits. Efforts to explore the evolutionary processes that promoted diversification within Ctenophora are hindered by undersampling genomic diversity within this clade. To address this gap, we present the sequence, assembly and initial annotation of the genome of <i>Beroe ovata</i>. <i>Beroe</i> possess unique morphology, behavior, ecology and development. Unlike their generalist carnivorous kin, beroid ctenophores feed exclusively on other ctenophores. Accordingly, our analyses revealed a loss of chitinase, an enzyme critical for the digestion of most non-ctenophore prey, but superfluous for ctenophorivores. Broadly, our genomic analysis revealed that extensive gene loss and changes in gene regulation have shaped the unique biology of <i>B. ovata</i>. Despite the gene losses in <i>B. ovata</i>, our phylogenetic analyses on photosensitive opsins and several early developmental regulatory genes show that these genes are conserved in <i>B. ovata</i>. This additional sampling contributes to a more complete reconstruction of the ctenophore ancestor and points to the need for extensive comparisons within this ancient and diverse clade of animals. To promote further exploration of these data, we present BovaDB (http://ryanlab.whitney.ufl.edu/bovadb/), a portal for the <i>B. ovata</i> genome.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae072"},"PeriodicalIF":4.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11184263/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141421206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative analysis and classification of highly divergent mouse rDNA units based on their intergenic spacer (IGS) variability. 基于基因间间隔(IGS)变异性对高度分化的小鼠 rDNA 单元进行比较分析和分类。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-14 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae070
Jung-Hyun Kim, Ramaiah Nagaraja, Alexey Y Ogurtsov, Vladimir N Noskov, Mikhail Liskovykh, Hee-Sheung Lee, Yutaro Hori, Takehiko Kobayashi, Kent Hunter, David Schlessinger, Natalay Kouprina, Svetlana A Shabalina, Vladimir Larionov

Ribosomal DNA (rDNA) repeat units are organized into tandem clusters in eukaryotic cells. In mice, these clusters are located on at least eight chromosomes and show extensive variation in the number of repeats between mouse genomes. To analyze intra- and inter-genomic variation of mouse rDNA repeats, we selectively isolated 25 individual rDNA units using Transformation-Associated Recombination (TAR) cloning. Long-read sequencing and subsequent comparative sequence analysis revealed that each full-length unit comprises an intergenic spacer (IGS) and a ∼13.4 kb long transcribed region encoding the three rRNAs, but with substantial variability in rDNA unit size, ranging from ∼35 to ∼46 kb. Within the transcribed regions of rDNA units, we found 209 variants, 70 of which are in external transcribed spacers (ETSs); but the rDNA size differences are driven primarily by IGS size heterogeneity, due to indels containing repetitive elements and some functional signals such as enhancers. Further evolutionary analysis categorized rDNA units into distinct clusters with characteristic IGS lengths; numbers of enhancers; and presence/absence of two common SNPs in promoter regions, one of which is located within promoter (p)RNA and may influence pRNA folding stability. These characteristic features of IGSs also correlated significantly with 5'ETS variant patterns described previously and associated with differential expression of rDNA units. Our results suggest that variant rDNA units are differentially regulated and open a route to investigate the role of rDNA variation on nucleolar formation and possible associations with pathology.

核糖体 DNA(rDNA)重复单位在真核细胞中组成串联簇。在小鼠体内,这些簇至少位于八条染色体上,而且不同小鼠基因组之间的重复序列数量差异很大。为了分析小鼠 rDNA 重复序列在基因组内和基因组间的变化,我们利用转化相关重组(TAR)克隆技术选择性地分离了 25 个独立的 rDNA 单元。长读测序和随后的序列比较分析表明,每个全长单元包括一个基因间间隔(IGS)和一个长达 13.4 kb 的转录区,编码三种 rRNA,但 rDNA 单元大小差异很大,从 35 kb 到 46 kb 不等。在 rDNA 单元的转录区域内,我们发现了 209 个变体,其中 70 个位于外部转录间隔区(ETS);但 rDNA 大小差异主要是由 IGS 大小异质性引起的,这是由于含有重复元件和一些功能信号(如增强子)的嵌合体造成的。进一步的进化分析将 rDNA 单位分为不同的群组,这些群组具有特征性的 IGS 长度、增强子数量以及启动子区域中两个常见 SNP 的存在/不存在,其中一个 SNP 位于启动子(p)RNA 内,可能会影响 pRNA 折叠的稳定性。IGSs的这些特征还与之前描述的5'ETS变异模式显著相关,并与rDNA单元的差异表达有关。我们的研究结果表明,变异的 rDNA 单元受到不同的调控,这为研究 rDNA 变异对核极形成的作用以及可能与病理学的关联开辟了一条途径。
{"title":"Comparative analysis and classification of highly divergent mouse rDNA units based on their intergenic spacer (IGS) variability.","authors":"Jung-Hyun Kim, Ramaiah Nagaraja, Alexey Y Ogurtsov, Vladimir N Noskov, Mikhail Liskovykh, Hee-Sheung Lee, Yutaro Hori, Takehiko Kobayashi, Kent Hunter, David Schlessinger, Natalay Kouprina, Svetlana A Shabalina, Vladimir Larionov","doi":"10.1093/nargab/lqae070","DOIUrl":"10.1093/nargab/lqae070","url":null,"abstract":"<p><p>Ribosomal DNA (rDNA) repeat units are organized into tandem clusters in eukaryotic cells. In mice, these clusters are located on at least eight chromosomes and show extensive variation in the number of repeats between mouse genomes. To analyze intra- and inter-genomic variation of mouse rDNA repeats, we selectively isolated 25 individual rDNA units using Transformation-Associated Recombination (TAR) cloning. Long-read sequencing and subsequent comparative sequence analysis revealed that each full-length unit comprises an intergenic spacer (IGS) and a ∼13.4 kb long transcribed region encoding the three rRNAs, but with substantial variability in rDNA unit size, ranging from ∼35 to ∼46 kb. Within the transcribed regions of rDNA units, we found 209 variants, 70 of which are in external transcribed spacers (ETSs); but the rDNA size differences are driven primarily by IGS size heterogeneity, due to indels containing repetitive elements and some functional signals such as enhancers. Further evolutionary analysis categorized rDNA units into distinct clusters with characteristic IGS lengths; numbers of enhancers; and presence/absence of two common SNPs in promoter regions, one of which is located within promoter (p)RNA and may influence pRNA folding stability. These characteristic features of IGSs also correlated significantly with 5'ETS variant patterns described previously and associated with differential expression of rDNA units. Our results suggest that variant rDNA units are differentially regulated and open a route to investigate the role of rDNA variation on nucleolar formation and possible associations with pathology.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae070"},"PeriodicalIF":4.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11177557/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141331907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
omicsMIC: a comprehensive benchmarking platform for robust comparison of imputation methods in mass spectrometry-based omics data. omicsMIC:一个综合基准平台,用于对基于质谱的整体组学数据中的估算方法进行稳健比较。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-06-14 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae071
Weiqiang Lin, Jiadong Ji, Kuan-Jui Su, Chuan Qiu, Qing Tian, Lan-Juan Zhao, Zhe Luo, Chong Wu, Hui Shen, Hongwen Deng

Mass spectrometry is a powerful and widely used tool for generating proteomics, lipidomics and metabolomics profiles, which is pivotal for elucidating biological processes and identifying biomarkers. However, missing values in mass spectrometry-based omics data may pose a critical challenge for the comprehensive identification of biomarkers and elucidation of the biological processes underlying human complex disorders. To alleviate this issue, various imputation methods for mass spectrometry-based omics data have been developed. However, a comprehensive comparison of these imputation methods is still lacking, and researchers are frequently confronted with a multitude of options without a clear rationale for method selection. To address this pressing need, we developed omicsMIC (mass spectrometry-based omics with Missing values Imputation methods Comparison platform), an interactive platform that provides researchers with a versatile framework to evaluate the performance of 28 diverse imputation methods. omicsMIC offers a nuanced perspective, acknowledging the inherent heterogeneity in biological data and the unique attributes of each dataset. Our platform empowers researchers to make data-driven decisions in imputation method selection based on real-time visualizations of the outcomes associated with different imputation strategies. The comprehensive benchmarking and versatility of omicsMIC make it a valuable tool for the scientific community engaged in mass spectrometry-based omics research. omicsMIC is freely available at https://github.com/WQLin8/omicsMIC.

质谱技术是生成蛋白质组学、脂质组学和代谢组学图谱的强大而广泛使用的工具,对于阐明生物过程和确定生物标志物至关重要。然而,基于质谱的组学数据中的缺失值可能会对全面鉴定生物标志物和阐明人类复杂疾病的生物过程构成严峻挑战。为了缓解这一问题,人们开发了各种基于质谱的全局数据估算方法。然而,目前仍缺乏对这些估算方法的全面比较,研究人员经常面临多种选择,却没有明确的方法选择理由。为了满足这一迫切需求,我们开发了 omicsMIC(基于质谱的缺失值归因方法比较平台),这是一个互动平台,为研究人员提供了一个多功能框架,用于评估 28 种不同归因方法的性能。我们的平台使研究人员能够根据不同估算策略相关结果的实时可视化,在选择估算方法时做出数据驱动的决策。omicsMIC 的全面基准性和多功能性使其成为科学界从事基于质谱的组学研究的重要工具。omicsMIC 可在 https://github.com/WQLin8/omicsMIC 免费获取。
{"title":"omicsMIC: a comprehensive benchmarking platform for robust comparison of imputation methods in mass spectrometry-based omics data.","authors":"Weiqiang Lin, Jiadong Ji, Kuan-Jui Su, Chuan Qiu, Qing Tian, Lan-Juan Zhao, Zhe Luo, Chong Wu, Hui Shen, Hongwen Deng","doi":"10.1093/nargab/lqae071","DOIUrl":"10.1093/nargab/lqae071","url":null,"abstract":"<p><p>Mass spectrometry is a powerful and widely used tool for generating proteomics, lipidomics and metabolomics profiles, which is pivotal for elucidating biological processes and identifying biomarkers. However, missing values in mass spectrometry-based omics data may pose a critical challenge for the comprehensive identification of biomarkers and elucidation of the biological processes underlying human complex disorders. To alleviate this issue, various imputation methods for mass spectrometry-based omics data have been developed. However, a comprehensive comparison of these imputation methods is still lacking, and researchers are frequently confronted with a multitude of options without a clear rationale for method selection. To address this pressing need, we developed omicsMIC (mass spectrometry-based omics with Missing values Imputation methods Comparison platform), an interactive platform that provides researchers with a versatile framework to evaluate the performance of 28 diverse imputation methods. omicsMIC offers a nuanced perspective, acknowledging the inherent heterogeneity in biological data and the unique attributes of each dataset. Our platform empowers researchers to make data-driven decisions in imputation method selection based on real-time visualizations of the outcomes associated with different imputation strategies. The comprehensive benchmarking and versatility of omicsMIC make it a valuable tool for the scientific community engaged in mass spectrometry-based omics research. omicsMIC is freely available at https://github.com/WQLin8/omicsMIC.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae071"},"PeriodicalIF":4.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11177553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141331908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GTDrift: a resource for exploring the interplay between genetic drift, genomic and transcriptomic characteristics in eukaryotes. GTDrift:用于探索真核生物中遗传漂变、基因组和转录组特征之间相互作用的资源。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-12 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae064
Florian Bénitière, Laurent Duret, Anamaria Necsulea

We present GTDrift, a comprehensive data resource that enables explorations of genomic and transcriptomic characteristics alongside proxies of the intensity of genetic drift in individual species. This resource encompasses data for 1506 eukaryotic species, including 1413 animals and 93 green plants, and is organized in three components. The first two components contain approximations of the effective population size, which serve as indicators of the extent of random genetic drift within each species. In the first component, we meticulously investigated public databases to assemble data on life history traits such as longevity, adult body length and body mass for a set of 979 species. The second component includes estimations of the ratio between the rate of non-synonymous substitutions and the rate of synonymous substitutions (dN/dS) in protein-coding sequences for 1324 species. This ratio provides an estimate of the efficiency of natural selection in purging deleterious substitutions. Additionally, we present polymorphism-derived N e estimates for 66 species. The third component encompasses various genomic and transcriptomic characteristics. With this component, we aim to facilitate comparative transcriptomics analyses across species, by providing easy-to-use processed data for more than 16 000 RNA-seq samples across 491 species. These data include intron-centered alternative splicing frequencies, gene expression levels and sequencing depth statistics for each species, obtained with a homogeneous analysis protocol. To enable cross-species comparisons, we provide orthology predictions for conserved single-copy genes based on BUSCO gene sets. To illustrate the possible uses of this database, we identify the most frequently used introns for each gene and we assess how the sequencing depth available for each species affects our power to identify major and minor splice variants.

我们介绍的 GTDrift 是一个综合数据资源,可用于探索基因组和转录组特征,以及单个物种遗传漂变强度的代用指标。该资源包含 1506 个真核生物物种的数据,其中包括 1413 种动物和 93 种绿色植物,由三个部分组成。前两个部分包含有效种群规模的近似值,作为每个物种内部随机遗传漂变程度的指标。在第一部分中,我们对公共数据库进行了细致的调查,收集了 979 个物种的生命史特征数据,如寿命、成年体长和体重。第二部分包括对 1324 个物种的蛋白质编码序列中的非同义替换率和同义替换率之间的比率(dN/dS)的估计。该比率可估算出自然选择清除有害替换的效率。此外,我们还提供了 66 个物种的多态性衍生 N e 估计值。第三部分包括各种基因组和转录组特征。通过该部分,我们提供了 491 个物种 16000 多个 RNA-seq 样本的易用处理数据,旨在促进跨物种的转录组学比较分析。这些数据包括以内含子为中心的替代剪接频率、基因表达水平和每个物种的测序深度统计数据,这些数据都是通过同质分析协议获得的。为了进行跨物种比较,我们提供了基于 BUSCO 基因组的保守单拷贝基因的同源预测。为了说明该数据库的可能用途,我们确定了每个基因最常用的内含子,并评估了每个物种的测序深度如何影响我们确定主要和次要剪接变异的能力。
{"title":"GTDrift: a resource for exploring the interplay between genetic drift, genomic and transcriptomic characteristics in eukaryotes.","authors":"Florian Bénitière, Laurent Duret, Anamaria Necsulea","doi":"10.1093/nargab/lqae064","DOIUrl":"10.1093/nargab/lqae064","url":null,"abstract":"<p><p>We present GTDrift, a comprehensive data resource that enables explorations of genomic and transcriptomic characteristics alongside proxies of the intensity of genetic drift in individual species. This resource encompasses data for 1506 eukaryotic species, including 1413 animals and 93 green plants, and is organized in three components. The first two components contain approximations of the effective population size, which serve as indicators of the extent of random genetic drift within each species. In the first component, we meticulously investigated public databases to assemble data on life history traits such as longevity, adult body length and body mass for a set of 979 species. The second component includes estimations of the ratio between the rate of non-synonymous substitutions and the rate of synonymous substitutions (d<i>N</i>/d<i>S</i>) in protein-coding sequences for 1324 species. This ratio provides an estimate of the efficiency of natural selection in purging deleterious substitutions. Additionally, we present polymorphism-derived <i>N</i> <sub>e</sub> estimates for 66 species. The third component encompasses various genomic and transcriptomic characteristics. With this component, we aim to facilitate comparative transcriptomics analyses across species, by providing easy-to-use processed data for more than 16 000 RNA-seq samples across 491 species. These data include intron-centered alternative splicing frequencies, gene expression levels and sequencing depth statistics for each species, obtained with a homogeneous analysis protocol. To enable cross-species comparisons, we provide orthology predictions for conserved single-copy genes based on BUSCO gene sets. To illustrate the possible uses of this database, we identify the most frequently used introns for each gene and we assess how the sequencing depth available for each species affects our power to identify major and minor splice variants.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae064"},"PeriodicalIF":4.6,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11167491/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141311879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-based learning to predict and model protein-DNA interactions and transcription-factor co-operativity in cis-regulatory elements. 基于结构的学习,预测和模拟顺式调控元件中蛋白质与 DNA 的相互作用以及转录因子的协同作用。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-12 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae068
Fornes Oriol, Meseguer Alberto, Aguirre-Plans Joachim, Gohl Patrick, Bota Patricia M, Molina-Fernández Ruben, Bonet Jaume, Chinchilla-Hernandez Altair, Pegenaute Ferran, Gallego Oriol, Fernandez-Fuentes Narcis, Oliva Baldo

Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF-DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ∼25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. We introduce a structure-based learning approach to predict the binding preferences of TFs and the automated modelling of TF regulatory complexes. We show the advantage of using our approach over the classical nearest-neighbor prediction in the limits of remote homology. Starting from a TF sequence or structure, we predict binding preferences in the form of motifs that are then used to scan a DNA sequence for occurrences. The best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA. Co-operativity is modelled by: (i) the co-localization of TFs and (ii) the structural modeling of protein-protein interactions between TFs and with co-factors. We have applied our approach to automatically model the interferon-β enhanceosome and the pioneering complexes of OCT4, SOX2 (or SOX11) and KLF4 with a nucleosome, which are compared with the experimentally known structures.

转录因子(TF)结合是基因组调控的关键组成部分。目前有许多高通量实验方法来表征 TF-DNA 结合的特异性。然而,这些方法的应用既费力又昂贵,这使得对所有转录因子进行分析具有挑战性。例如,25% 的人类 TFs 的结合偏好仍然未知;它们既没有通过实验确定,也没有通过计算推断。我们介绍了一种基于结构的学习方法,用于预测 TF 的结合偏好和 TF 调控复合物的自动建模。我们展示了在远缘同源性的限制下,使用我们的方法比经典的最近邻预测更有优势。从 TF 序列或结构出发,我们以图案的形式预测结合偏好,然后用图案扫描 DNA 序列,寻找出现的图案。最好的匹配结果会以结合得分进行分析,或者收集起来,以便随后将其建模为与 DNA 的高阶调控复合物。协同作用的建模方法包括(i) TFs 的共定位;(ii) TFs 之间以及 TFs 与辅助因子之间蛋白质-蛋白质相互作用的结构建模。我们应用我们的方法自动建模了干扰素-β增强体以及 OCT4、SOX2(或 SOX11)和 KLF4 与核小体的先驱复合体,并将其与实验已知的结构进行了比较。
{"title":"Structure-based learning to predict and model protein-DNA interactions and transcription-factor co-operativity in <i>cis</i>-regulatory elements.","authors":"Fornes Oriol, Meseguer Alberto, Aguirre-Plans Joachim, Gohl Patrick, Bota Patricia M, Molina-Fernández Ruben, Bonet Jaume, Chinchilla-Hernandez Altair, Pegenaute Ferran, Gallego Oriol, Fernandez-Fuentes Narcis, Oliva Baldo","doi":"10.1093/nargab/lqae068","DOIUrl":"10.1093/nargab/lqae068","url":null,"abstract":"<p><p>Transcription factor (TF) binding is a key component of genomic regulation. There are numerous high-throughput experimental methods to characterize TF-DNA binding specificities. Their application, however, is both laborious and expensive, which makes profiling all TFs challenging. For instance, the binding preferences of ∼25% human TFs remain unknown; they neither have been determined experimentally nor inferred computationally. We introduce a structure-based learning approach to predict the binding preferences of TFs and the automated modelling of TF regulatory complexes. We show the advantage of using our approach over the classical nearest-neighbor prediction in the limits of remote homology. Starting from a TF sequence or structure, we predict binding preferences in the form of motifs that are then used to scan a DNA sequence for occurrences. The best matches are either profiled with a binding score or collected for their subsequent modeling into a higher-order regulatory complex with DNA. Co-operativity is modelled by: (i) the co-localization of TFs and (ii) the structural modeling of protein-protein interactions between TFs and with co-factors. We have applied our approach to automatically model the interferon-β enhanceosome and the pioneering complexes of OCT4, SOX2 (or SOX11) and KLF4 with a nucleosome, which are compared with the experimentally known structures.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae068"},"PeriodicalIF":4.6,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11167492/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141311880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Critical cis-parameters influence STructure assisted RNA translation (START) initiation on non-AUG codons in eukaryotes. 影响真核生物非 AUG 密码子上结构辅助 RNA 翻译(START)启动的关键顺式参数。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-11 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae065
Antonin Tidu, Fatima Alghoul, Laurence Despons, Gilbert Eriani, Franck Martin

In eukaryotes, translation initiation is a highly regulated process, which combines cis-regulatory sequences located on the messenger RNA along with trans-acting factors like eukaryotic initiation factors (eIF). One critical step of translation initiation is the start codon recognition by the scanning 43S particle, which leads to ribosome assembly and protein synthesis. In this study, we investigated the involvement of secondary structures downstream the initiation codon in the so-called START (STructure-Assisted RNA translation) mechanism on AUG and non-AUG translation initiation. The results demonstrate that downstream secondary structures can efficiently promote non-AUG translation initiation if they are sufficiently stable to stall a scanning 43S particle and if they are located at an optimal distance from non-AUG codons to stabilize the codon-anticodon base pairing in the P site. The required stability of the downstream structure for efficient translation initiation varies in distinct cell types. We extended this study to genome-wide analysis of functionally characterized alternative translation initiation sites in Homo sapiens. This analysis revealed that about 25% of these sites have an optimally located downstream secondary structure of adequate stability which could elicit START, regardless of the start codon. We validated the impact of these structures on translation initiation for several selected uORFs.

在真核生物中,翻译启动是一个高度受调控的过程,它结合了信使 RNA 上的顺式调控序列和真核生物启动因子(eIF)等反式作用因子。翻译启动的一个关键步骤是扫描 43S 颗粒识别起始密码子,从而导致核糖体组装和蛋白质合成。在本研究中,我们研究了启动密码子下游二级结构参与所谓的 START(结构辅助 RNA 翻译)机制对 AUG 和非 AUG 翻译启动的影响。研究结果表明,如果下游二级结构足够稳定,能够阻滞扫描 43S 粒子,并且与非 AUG 密码子保持最佳距离,以稳定 P 位点的密码子-反密码子碱基配对,那么它们就能有效地促进非 AUG 翻译的启动。在不同的细胞类型中,高效翻译起始所需的下游结构稳定性各不相同。我们将这项研究扩展到对智人中具有功能特征的替代翻译起始位点的全基因组分析。分析结果表明,在这些位点中,约有 25% 的位点具有最佳位置的下游二级结构,具有足够的稳定性,无论起始密码子如何,都能引发 START。我们验证了这些结构对几个选定的 uORFs 翻译启动的影响。
{"title":"Critical <i>cis</i>-parameters influence STructure assisted RNA translation (START) initiation on non-AUG codons in eukaryotes.","authors":"Antonin Tidu, Fatima Alghoul, Laurence Despons, Gilbert Eriani, Franck Martin","doi":"10.1093/nargab/lqae065","DOIUrl":"10.1093/nargab/lqae065","url":null,"abstract":"<p><p>In eukaryotes, translation initiation is a highly regulated process, which combines <i>cis-</i>regulatory sequences located on the messenger RNA along with <i>trans-</i>acting factors like eukaryotic initiation factors (eIF). One critical step of translation initiation is the start codon recognition by the scanning 43S particle, which leads to ribosome assembly and protein synthesis. In this study, we investigated the involvement of secondary structures downstream the initiation codon in the so-called START (STructure-Assisted RNA translation) mechanism on AUG and non-AUG translation initiation. The results demonstrate that downstream secondary structures can efficiently promote non-AUG translation initiation if they are sufficiently stable to stall a scanning 43S particle and if they are located at an optimal distance from non-AUG codons to stabilize the codon-anticodon base pairing in the P site. The required stability of the downstream structure for efficient translation initiation varies in distinct cell types. We extended this study to genome-wide analysis of functionally characterized alternative translation initiation sites in <i>Homo sapiens</i>. This analysis revealed that about 25% of these sites have an optimally located downstream secondary structure of adequate stability which could elicit START, regardless of the start codon. We validated the impact of these structures on translation initiation for several selected uORFs.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae065"},"PeriodicalIF":4.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165317/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141307012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved selection of canonical proteins for reference proteomes. 改进参考蛋白质组的典型蛋白质选择。
IF 4.6 Q1 GENETICS & HEREDITY Pub Date : 2024-06-08 eCollection Date: 2024-06-01 DOI: 10.1093/nargab/lqae066
Giuseppe Insana, Maria J Martin, William R Pearson

The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

UniProt 发布的 "典型 "蛋白质集被广泛用于相似性搜索以及功能和结构注释。对许多研究人员来说,典型序列是蛋白质的唯一研究对象。然而,高等真核生物往往从一个基因中编码多种蛋白质同工型。对于未审查的(UniProtKB/TrEMBL)蛋白质序列,以基因为中心的组中最长的序列被选为标准序列。这种选择可能会造成不一致,选择出长度相差很大但>95%相同的直向同源物,而这在生物学上是不可能的。我们介绍了 ortho2tree 管道,它可以检查来自同源蛋白质组的参考蛋白质组同源序列和异构体序列,建立多重比对,构建间距树,并识别长度相似的低成本异构体支系。在研究了 UniProtKB 第 2022_05 版中来自 8 种哺乳动物的 140,000 个蛋白质后,ortho2tree 为第 2023_01 版提出了 7804 个同源变化,同时确认了 53,434 个同源变化。正交2tree选择的同工酶的间隙分布与细菌和酵母排列中的间隙分布相似,生物体不受同工酶选择的影响,这表明正交2tree同工酶更准确地反映了真正的生物变异。82%的正交树拟议变异与MANE一致;92%的确认同义词与MANE一致。Ortho2tree 可以改进相同度大于 60% 的直向同源序列(包括脊椎动物和高等植物)的典型分配。
{"title":"Improved selection of canonical proteins for reference proteomes.","authors":"Giuseppe Insana, Maria J Martin, William R Pearson","doi":"10.1093/nargab/lqae066","DOIUrl":"10.1093/nargab/lqae066","url":null,"abstract":"<p><p>The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae066"},"PeriodicalIF":4.6,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141307793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
NAR Genomics and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1