首页 > 最新文献

NAR Genomics and Bioinformatics最新文献

英文 中文
Evaluation of machine learning models that predict lncRNA subcellular localization. 评估预测 lncRNA 亚细胞定位的机器学习模型。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-18 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae125
Jason R Miller, Weijun Yi, Donald A Adjeroh

The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, e.g. 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.

lncATLAS 数据库量化了在 15 种人类细胞系中观察到的长非编码 RNA(lncRNA)在细胞质与细胞核中的相对丰度。文献介绍了在这些数据集和类似数据集上训练和评估的几种机器学习模型。这些报告显示,这些模型在训练数据的测试子集上表现一般,例如准确率为 72-74%。在所有这些报告中,数据集都经过过滤,以包括具有极端值的基因,同时排除具有中间范围值的基因,而且过滤是在将数据划分为训练和测试子集之前进行的。我们使用几个模型和 lncATLAS 数据表明,这种 "中间排除 "协议提高了性能指标,但并没有提高模型在未过滤测试数据上的性能。我们发现,在未过滤的 lncRNA 数据上进行评估时,各种模型的准确率只有 60% 左右。我们认为,从核苷酸序列预测 lncRNA 亚细胞定位的问题比目前认为的更具挑战性。我们提供了一个基本模型和评估程序,作为今后研究该问题的基准。
{"title":"Evaluation of machine learning models that predict lncRNA subcellular localization.","authors":"Jason R Miller, Weijun Yi, Donald A Adjeroh","doi":"10.1093/nargab/lqae125","DOIUrl":"https://doi.org/10.1093/nargab/lqae125","url":null,"abstract":"<p><p>The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, <i>e.g</i>. 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae125"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data. GeneSPIDER2:利用扰动单细胞数据进行大规模 GRN 模拟和基准测试。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-18 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae121
Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer

Single-cell data is increasingly used for gene regulatory network (GRN) inference, and benchmarks for this have been developed based on simulated data. However, existing single-cell simulators cannot model the effects of gene perturbations. A further challenge lies in generating large-scale GRNs that often struggle with computational and stability issues. We present GeneSPIDER2, an update of the GeneSPIDER MATLAB toolbox for GRN benchmarking, inference, and analysis. Several software modules have improved capabilities and performance, and new functionalities have been added. A major improvement is the ability to generate large GRNs with biologically realistic topological properties in terms of scale-free degree distribution and modularity. Another major addition is a simulation of single-cell data, which is becoming increasingly popular as input for GRN inference. Specifically, we introduced the unique feature to generate single-cell data based on genetic perturbations. Finally, the simulated single-cell data was compared to real single-cell Perturb-seq data from two cell lines, showing that the synthetic and real data exhibit similar properties.

单细胞数据越来越多地用于基因调控网络(GRN)推断,并根据模拟数据开发了相关基准。然而,现有的单细胞模拟器无法模拟基因扰动的影响。另一个挑战在于生成大规模的基因调控网络(GRN),而这些网络往往在计算和稳定性方面存在问题。我们介绍 GeneSPIDER2,它是用于 GRN 基准、推理和分析的 GeneSPIDER MATLAB 工具箱的升级版。多个软件模块的功能和性能得到了提高,并增加了新的功能。其中一项重大改进是能够生成大型 GRN,这些 GRN 在无标度度分布和模块化方面具有符合生物学实际的拓扑特性。另一项重大改进是模拟单细胞数据,单细胞数据作为 GRN 推断的输入越来越受欢迎。具体来说,我们引入了基于遗传扰动生成单细胞数据的独特功能。最后,我们将模拟的单细胞数据与来自两个细胞系的真实单细胞 Perturb-seq 数据进行了比较,结果表明合成数据和真实数据具有相似的特性。
{"title":"GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data.","authors":"Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer","doi":"10.1093/nargab/lqae121","DOIUrl":"https://doi.org/10.1093/nargab/lqae121","url":null,"abstract":"<p><p>Single-cell data is increasingly used for gene regulatory network (GRN) inference, and benchmarks for this have been developed based on simulated data. However, existing single-cell simulators cannot model the effects of gene perturbations. A further challenge lies in generating large-scale GRNs that often struggle with computational and stability issues. We present GeneSPIDER2, an update of the GeneSPIDER MATLAB toolbox for GRN benchmarking, inference, and analysis. Several software modules have improved capabilities and performance, and new functionalities have been added. A major improvement is the ability to generate large GRNs with biologically realistic topological properties in terms of scale-free degree distribution and modularity. Another major addition is a simulation of single-cell data, which is becoming increasingly popular as input for GRN inference. Specifically, we introduced the unique feature to generate single-cell data based on genetic perturbations. Finally, the simulated single-cell data was compared to real single-cell Perturb-seq data from two cell lines, showing that the synthetic and real data exhibit similar properties.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae121"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409065/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences. 变压器模型生成的噬菌体基因组在组成上有别于天然序列。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-18 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae129
Jeremy Ratcliff

Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 de novo synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% (n = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.

语言模型在基因组学中的新应用有望对该领域产生巨大影响。megaDNA模型是第一个用于创建合成病毒基因组的公开生成模型。为了评估 megaDNA 重现病毒非随机基因组组成的能力,并评估合成基因组是否能通过算法检测出来,我们比较了 4969 个天然噬菌体基因组和 1002 个全新合成噬菌体基因组的组成指标。变形体生成的序列基因组长度各不相同,但都符合实际情况,其中 58% 被 geNomad 归类为病毒。不过,通过秩和检验和主成分分析,这些序列与天然噬菌体基因组相比,在各种组成指标上表现出一致的差异。一个经过训练的简单神经网络仅根据全局组成指标来检测变压器产生的序列,其灵敏度中位数为 93.0%,特异性为 97.9%(n = 12 个独立模型)。总之,这些结果表明,megaDNA 还不能生成具有实际组成偏差的噬菌体基因组,而基因组组成是检测该模型所生成序列的可靠方法。虽然这些结果是针对 megaDNA 模型的,但这里描述的评估框架可以应用于任何基因组序列的生成模型。
{"title":"Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.","authors":"Jeremy Ratcliff","doi":"10.1093/nargab/lqae129","DOIUrl":"https://doi.org/10.1093/nargab/lqae129","url":null,"abstract":"<p><p>Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 <i>de novo</i> synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% (<i>n</i> = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae129"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to 'long non-coding RNAs involved in Drosophila development and regeneration'. 参与果蝇发育和再生的长非编码 RNAs "的更正。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-14 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae127

[This corrects the article DOI: 10.1093/nargab/lqae091.].

[This corrects the article DOI: 10.1093/nargab/lqae091.].
{"title":"Correction to <b>'</b>long non-coding RNAs involved in <i>Drosophila</i> development and regeneration'.","authors":"","doi":"10.1093/nargab/lqae127","DOIUrl":"https://doi.org/10.1093/nargab/lqae127","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nargab/lqae091.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae127"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlaHMM: unistrand flamenco-like piRNA cluster prediction in Drosophila species using hidden Markov models. FlaHMM:利用隐马尔可夫模型预测果蝇物种中的单链弗拉门戈式 piRNA 簇。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-14 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae119
Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv

PIWI-interacting RNAs (piRNAs) are a class of small non-coding RNAs that are essential for transposon control in animal gonads. In Drosophila ovarian somatic cells, piRNAs are transcribed from large genomic regions called piRNA clusters, which are enriched for transposon fragments and act as a memory of past invasions. Despite being widely present across Drosophila species, somatic piRNA clusters are difficult to identify and study due to their lack of sequence conservation and limited synteny. Current identification methods rely on either extensive manual curation or availability of high-throughput small RNA sequencing data, limiting large-scale comparative studies. We now present FlaHMM, a hidden Markov model developed to automate genomic annotation of flamenco-like unistrand piRNA clusters in Drosophila species, requiring only a genome assembly and transposon annotations. FlaHMM uses transposable element content across 5- or 10-kb bins, which can be calculated from genome sequence alone, and is thus able to detect candidate piRNA clusters without the need to obtain flies and experimentally perform small RNA sequencing. We show that FlaHMM performs on par with piRNA-guided or manual methods, and thus provides a scalable and efficient approach to piRNA cluster annotation in new genome assemblies. FlaHMM is freely available at https://github.com/Hannon-lab/FlaHMM under an MIT licence.

PIWI-interacting RNA(piRNA)是一类小型非编码 RNA,对动物性腺中的转座子控制至关重要。在果蝇卵巢体细胞中,piRNA 从称为 piRNA 簇的大型基因组区域转录,这些区域富含转座子片段,是对过去入侵的记忆。尽管体细胞 piRNA 簇广泛存在于果蝇物种中,但由于其缺乏序列保护和有限的同源性,很难对其进行鉴定和研究。目前的识别方法依赖于大量的手工整理或高通量小 RNA 测序数据,从而限制了大规模的比较研究。我们现在介绍的 FlaHMM 是一种隐马尔可夫模型,用于自动注释果蝇中弗拉门戈样单链 piRNA 簇的基因组,只需要基因组组装和转座子注释。FlaHMM 使用的是 5 或 10-kb bins 的转座子含量,仅通过基因组序列就能计算出该含量,因此无需获取果蝇和进行小 RNA 测序实验就能检测到候选 piRNA 簇。我们的研究表明,FlaHMM 的性能与 piRNA 引导或人工方法相当,因此为新基因组组装中的 piRNA 簇注释提供了一种可扩展的高效方法。FlaHMM 在 MIT 许可下可在 https://github.com/Hannon-lab/FlaHMM 免费获取。
{"title":"FlaHMM: unistrand <i>flamenco</i>-like piRNA cluster prediction in <i>Drosophila</i> species using hidden Markov models.","authors":"Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv","doi":"10.1093/nargab/lqae119","DOIUrl":"10.1093/nargab/lqae119","url":null,"abstract":"<p><p>PIWI-interacting RNAs (piRNAs) are a class of small non-coding RNAs that are essential for transposon control in animal gonads. In <i>Drosophila</i> ovarian somatic cells, piRNAs are transcribed from large genomic regions called piRNA clusters, which are enriched for transposon fragments and act as a memory of past invasions. Despite being widely present across <i>Drosophila</i> species, somatic piRNA clusters are difficult to identify and study due to their lack of sequence conservation and limited synteny. Current identification methods rely on either extensive manual curation or availability of high-throughput small RNA sequencing data, limiting large-scale comparative studies. We now present FlaHMM, a hidden Markov model developed to automate genomic annotation of <i>flamenco</i>-like unistrand piRNA clusters in <i>Drosophila</i> species, requiring only a genome assembly and transposon annotations. FlaHMM uses transposable element content across 5- or 10-kb bins, which can be calculated from genome sequence alone, and is thus able to detect candidate piRNA clusters without the need to obtain flies and experimentally perform small RNA sequencing. We show that FlaHMM performs on par with piRNA-guided or manual methods, and thus provides a scalable and efficient approach to piRNA cluster annotation in new genome assemblies. FlaHMM is freely available at https://github.com/Hannon-lab/FlaHMM under an MIT licence.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae119"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to 'Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites'. 哺乳动物 UTR 中的保守 RNA 结构群与 RBP 结合位点相关联》的更正。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-03 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae120

[This corrects the article DOI: 10.1093/nar/lqae089.].

[This corrects the article DOI: 10.1093/nar/lqae089.].
{"title":"Correction to 'Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites'.","authors":"","doi":"10.1093/nargab/lqae120","DOIUrl":"https://doi.org/10.1093/nargab/lqae120","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqae089.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae120"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning of metabolite-protein interactions from model-derived metabolic phenotypes. 从模型衍生的代谢表型中对代谢物-蛋白质相互作用进行机器学习。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-09-03 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae114
Mahdis Habibpour, Zahra Razaghi-Moghadam, Zoran Nikoloski

Unraveling metabolite-protein interactions is key to identifying the mechanisms by which metabolism affects the function of other cellular layers. Despite extensive experimental and computational efforts to identify the regulatory roles of metabolites in interaction with proteins, it remains challenging to achieve a genome-scale coverage of these interactions. Here, we leverage established gold standards for metabolite-protein interactions to train supervised classifiers using features derived from genome-scale metabolic models and matched data on protein abundance and reaction fluxes to distinguish interacting from non-interacting pairs. Through a comprehensive comparative study, we explore the impact of different features and assess the effect of gold standards for non-interacting pairs on the performance of the classifiers. Using data sets from Escherichia coli and Saccharomyces cerevisiae, we demonstrate that the features constructed by integrating fluxomic and proteomic data with metabolic phenotypes predicted from genome-scale metabolic models can be effectively used to train classifiers, accurately predicting metabolite-protein interactions in the context of metabolism. Our results reveal that the high performance of classifiers trained on these features is unaffected by the method used to generate gold standards for non-interacting pairs. Overall, our study introduces valuable features that improve the performance of identifying metabolite-protein interactions in the context of metabolism.

揭示代谢物与蛋白质的相互作用是确定代谢影响其他细胞层功能机制的关键。尽管为确定代谢物与蛋白质相互作用的调控作用进行了大量的实验和计算工作,但要实现这些相互作用的基因组规模覆盖仍具有挑战性。在这里,我们利用已建立的代谢物与蛋白质相互作用的黄金标准来训练有监督的分类器,使用从基因组规模的代谢模型以及蛋白质丰度和反应通量的匹配数据中获得的特征来区分相互作用和非相互作用对。通过全面的比较研究,我们探索了不同特征的影响,并评估了非相互作用对的黄金标准对分类器性能的影响。利用大肠杆菌和酿酒酵母的数据集,我们证明了将通量组和蛋白质组数据与基因组尺度代谢模型预测的代谢表型结合起来所构建的特征可以有效地用于训练分类器,准确预测代谢背景下代谢物与蛋白质的相互作用。我们的研究结果表明,根据这些特征训练的分类器的高性能不受用于生成非相互作用对金标准的方法的影响。总之,我们的研究引入了有价值的特征,提高了在代谢背景下识别代谢物-蛋白质相互作用的性能。
{"title":"Machine learning of metabolite-protein interactions from model-derived metabolic phenotypes.","authors":"Mahdis Habibpour, Zahra Razaghi-Moghadam, Zoran Nikoloski","doi":"10.1093/nargab/lqae114","DOIUrl":"10.1093/nargab/lqae114","url":null,"abstract":"<p><p>Unraveling metabolite-protein interactions is key to identifying the mechanisms by which metabolism affects the function of other cellular layers. Despite extensive experimental and computational efforts to identify the regulatory roles of metabolites in interaction with proteins, it remains challenging to achieve a genome-scale coverage of these interactions. Here, we leverage established gold standards for metabolite-protein interactions to train supervised classifiers using features derived from genome-scale metabolic models and matched data on protein abundance and reaction fluxes to distinguish interacting from non-interacting pairs. Through a comprehensive comparative study, we explore the impact of different features and assess the effect of gold standards for non-interacting pairs on the performance of the classifiers. Using data sets from <i>Escherichia coli</i> and <i>Saccharomyces cerevisiae</i>, we demonstrate that the features constructed by integrating fluxomic and proteomic data with metabolic phenotypes predicted from genome-scale metabolic models can be effectively used to train classifiers, accurately predicting metabolite-protein interactions in the context of metabolism. Our results reveal that the high performance of classifiers trained on these features is unaffected by the method used to generate gold standards for non-interacting pairs. Overall, our study introduces valuable features that improve the performance of identifying metabolite-protein interactions in the context of metabolism.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae114"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369697/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes. DANTE和DANTE_LTR:植物基因组中长末端重复反转座子的以系为中心的注释管道。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-08-29 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae113
Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas

Long terminal repeat (LTR) retrotransposons constitute a predominant class of repetitive DNA elements in most plant genomes. With the increasing number of sequenced plant genomes, there is an ongoing demand for computational tools facilitating efficient annotation and classification of LTR retrotransposons in plant genome assemblies. Herein, we introduce DANTE, a computational pipeline for Domain-based ANnotation of Transposable Elements, designed for sensitive detection of these elements via their conserved protein domain sequences. The identified protein domains are subsequently inputted into the DANTE_LTR pipeline to annotate complete element sequences by detecting their structural features, such as LTRs, in adjacent genomic regions. Leveraging domain sequences allows for precise classification of elements into phylogenetic lineages, offering a more granular annotation compared with coarser conventional superfamily-based classification methods. The efficiency and accuracy of this approach were evidenced via annotation of LTR retrotransposons in 93 plant genomes. Results were benchmarked against several established pipelines, showing that DANTE_LTR is capable of identifying significantly more intact LTR retrotransposons. DANTE and DANTE_LTR are provided as user-friendly Galaxy tools accessible via a public server (https://repeatexplorer-elixir.cerit-sc.cz), installable on local Galaxy instances from the Galaxy tool shed or executable from the command line.

长末端重复(LTR)反转座子是大多数植物基因组中最主要的一类重复 DNA 元件。随着植物基因组测序数量的不断增加,人们对计算工具的需求也在不断增长,这些工具有助于对植物基因组集合中的 LTR 逆转座子进行高效注释和分类。在此,我们介绍了 DANTE,这是一种基于结构域的可转座元件标注计算管道,旨在通过其保守的蛋白质结构域序列灵敏地检测这些元件。确定的蛋白质结构域随后被输入到 DANTE_LTR 管道中,通过检测邻近基因组区域中的结构特征(如 LTR)来注释完整的元件序列。利用结构域序列可以将元件精确分类到系统发生系中,与传统的基于超家族的粗略分类方法相比,这种方法提供了更精细的注释。通过对 93 个植物基因组中的 LTR 反转座子进行注释,证明了这种方法的效率和准确性。结果显示,DANTE_LTR能够识别出更多完整的LTR逆转录转座子。DANTE 和 DANTE_LTR 作为用户友好的 Galaxy 工具提供,可通过公共服务器(https://repeatexplorer-elixir.cerit-sc.cz)访问,也可从 Galaxy 工具箱安装到本地 Galaxy 实例或从命令行执行。
{"title":"DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes.","authors":"Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas","doi":"10.1093/nargab/lqae113","DOIUrl":"https://doi.org/10.1093/nargab/lqae113","url":null,"abstract":"<p><p>Long terminal repeat (LTR) retrotransposons constitute a predominant class of repetitive DNA elements in most plant genomes. With the increasing number of sequenced plant genomes, there is an ongoing demand for computational tools facilitating efficient annotation and classification of LTR retrotransposons in plant genome assemblies. Herein, we introduce DANTE, a computational pipeline for Domain-based ANnotation of Transposable Elements, designed for sensitive detection of these elements via their conserved protein domain sequences. The identified protein domains are subsequently inputted into the DANTE_LTR pipeline to annotate complete element sequences by detecting their structural features, such as LTRs, in adjacent genomic regions. Leveraging domain sequences allows for precise classification of elements into phylogenetic lineages, offering a more granular annotation compared with coarser conventional superfamily-based classification methods. The efficiency and accuracy of this approach were evidenced via annotation of LTR retrotransposons in 93 plant genomes. Results were benchmarked against several established pipelines, showing that DANTE_LTR is capable of identifying significantly more intact LTR retrotransposons. DANTE and DANTE_LTR are provided as user-friendly Galaxy tools accessible via a public server (https://repeatexplorer-elixir.cerit-sc.cz), installable on local Galaxy instances from the Galaxy tool shed or executable from the command line.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae113"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358816/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpikeFlow: automated and flexible analysis of ChIP-Seq data with spike-in control. SpikeFlow:自动、灵活地分析带有尖峰控制的 ChIP-Seq 数据。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-08-29 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae118
Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera

ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.

参考外源基因组 ChIP(ChIP-Rx)被广泛用于研究不同生物条件下组蛋白修饰的变化。对这种数据进行生物信息学分析的一个关键步骤是计算归一化因子,这些因子与标准的 ChIP-seq 管道不同。选择和应用适当的归一化方法对解释生物学结果至关重要。然而,目前还缺乏一套完整的 ChIP-Rx 数据分析管道。为了应对这些挑战,我们推出了 SpikeFlow,这是一个集成的 Snakemake 工作流程,它结合了各种现有工具的功能,可简化 ChIP-Rx 数据处理并提高可用性。SpikeFlow 可自动缩放尖峰数据,并提供多种归一化选项。它还能以不同的模式进行峰值调用和差异分析,从而检测组蛋白修饰和转录因子结合的富集区。我们的工作流程在所有处理步骤中都进行了深入的质量控制,并生成带表格和图表的分析报告,以方便结果解读。我们通过与 DiffBind 和 SpikChIP 进行比较分析,验证了这一工作流程,并在各种生物模型中证明了其强大的性能。SpikeFlow 将多种功能整合到一个平台中,旨在简化研究界的 ChIP-Rx 数据分析。
{"title":"SpikeFlow: automated and flexible analysis of ChIP-Seq data with spike-in control.","authors":"Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera","doi":"10.1093/nargab/lqae118","DOIUrl":"https://doi.org/10.1093/nargab/lqae118","url":null,"abstract":"<p><p>ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae118"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Context-adjusted proportion of singletons (CAPS): a novel metric for assessing negative selection in the human genome. 根据上下文调整的单子比例(CAPS):评估人类基因组负选择的新指标。
IF 4 Q1 GENETICS & HEREDITY Pub Date : 2024-08-29 eCollection Date: 2024-09-01 DOI: 10.1093/nargab/lqae111
Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou

Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.

对基因变异的解释仍然具有挑战性,部分原因是缺乏确定基因变异潜在致病性的成熟方法,尤其是对研究不足的变异类别。针对这一问题,群体遗传学方法提供了一种实用的解决方案,即通过人类群体分布来评估变异效应。负选择会影响单体变异的比例,并可作为缺失性的替代指标,变异调整后的单体变异比例(MAPS)指标就是一个例子。然而,MAPS 对单子-变异性线性模型的校准很敏感,这会导致对某些变异类别的估计出现偏差。在 MAPS 方法的基础上,我们引入了上下文调整的单子比例(CAPS)指标,用于评估人类基因组中的负选择。CAPS 通过消除模型中的突变层,产生具有更精确置信区间的校正估计值。CAPS 保留了 MAPS 的优点,是一种稳健可靠的工具。我们相信,CAPS 有潜力在临床和研究环境中加强对新疾病变异关联的鉴定,在评估不同 SNV 类别的负选择方面提供更高的准确性。
{"title":"Context-adjusted proportion of singletons (CAPS): a novel metric for assessing negative selection in the human genome.","authors":"Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou","doi":"10.1093/nargab/lqae111","DOIUrl":"https://doi.org/10.1093/nargab/lqae111","url":null,"abstract":"<p><p>Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae111"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
NAR Genomics and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1