首页 > 最新文献

bioRxiv - Bioinformatics最新文献

英文 中文
Revisiting the taxonomy of Enterococcus casseliflavus and related species 重新审视卡氏肠球菌及相关物种的分类法
Pub Date : 2024-09-17 DOI: 10.1101/2024.09.16.613146
Matheus Miguel Soares de Medeiros Lima, Janira Prichula, Tetsu Sakamoto
Enterococcus casseliflavus, a commonly mobile and yellow-colored bacterium, is a commensal member of the gastrointestinal tract. It is occasionally found in cases of bacteremia and other human infections. A concern is that all strains of this species have the vanC gene group on their chromosome, which confers resistance to vancomycin. The classification of E. casseliflavus is challenging, as it presents 99% identity in 16S analysis with E. gallinarum and, mainly, with E. flavescens, often being classified as a single species. This study aimed to revisit the taxonomy of E. casseliflavus and other related species by carrying out a comprehensive analysis of the genomic data available for these species in public databases.analyzing the genomic data. For this, 155 genomes of E. casseliflavus related species (E. casseliflavus, E. flavescens, E. entomosocium, and E. innesii) were retrieved and submitted to Average Nucleotide Identity (ANI) and phylogenomic analysis. Both approaches showed three well-delineated clusters which correspond to three Enterococcus species (E. casseliflavus, E. flavescens and E. innesii). Here we suggest (1) the removal of synonym status between E. flavescens and E. cassliflavus, and (2) addition of synonym status between E. entomosocium and E. casseliflavus.
卡氏肠球菌(Enterococcus casseliflavus)是一种常见的流动性黄色细菌,是胃肠道中的共生菌。它偶尔会出现在菌血症和其他人类感染病例中。一个令人担忧的问题是,该物种的所有菌株染色体上都有 VanC 基因组,从而对万古霉素产生抗药性。E.casseliflavus的分类具有挑战性,因为它与E.gallinarum(主要是E.flavescens)在16S分析中的同一性高达99%,经常被归类为单一物种。本研究旨在通过全面分析这些物种在公共数据库中的基因组数据,重新审视E. casseliflavus及其他相关物种的分类。为此,我们检索了155个E. casseliflavus相关物种(E. casseliflavus、E. flavescens、E. entomosocium和E. innesii)的基因组,并对其进行了平均核苷酸同一性(ANI)和系统发生组分析。这两种方法都显示出三个界限分明的聚类,分别对应三个肠球菌种(E. casseliflavus、E. flavescens 和 E.innesii)。在此,我们建议:(1)取消 E. flavescens 和 E. cassliflavus 之间的同义词地位;(2)增加 E. entomosocium 和 E. casseliflavus 之间的同义词地位。
{"title":"Revisiting the taxonomy of Enterococcus casseliflavus and related species","authors":"Matheus Miguel Soares de Medeiros Lima, Janira Prichula, Tetsu Sakamoto","doi":"10.1101/2024.09.16.613146","DOIUrl":"https://doi.org/10.1101/2024.09.16.613146","url":null,"abstract":"Enterococcus casseliflavus, a commonly mobile and yellow-colored bacterium, is a commensal member of the gastrointestinal tract. It is occasionally found in cases of bacteremia and other human infections. A concern is that all strains of this species have the vanC gene group on their chromosome, which confers resistance to vancomycin. The classification of E. casseliflavus is challenging, as it presents 99% identity in 16S analysis with E. gallinarum and, mainly, with E. flavescens, often being classified as a single species. This study aimed to revisit the taxonomy of E. casseliflavus and other related species by carrying out a comprehensive analysis of the genomic data available for these species in public databases.analyzing the genomic data. For this, 155 genomes of E. casseliflavus related species (E. casseliflavus, E. flavescens, E. entomosocium, and E. innesii) were retrieved and submitted to Average Nucleotide Identity (ANI) and phylogenomic analysis. Both approaches showed three well-delineated clusters which correspond to three Enterococcus species (E. casseliflavus, E. flavescens and E. innesii). Here we suggest (1) the removal of synonym status between E. flavescens and E. cassliflavus, and (2) addition of synonym status between E. entomosocium and E. casseliflavus.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"207 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
haCCA: Multi-module Integrating of spatial transcriptomes and metabolomes. haCCA:空间转录组和代谢组的多模块整合。
Pub Date : 2024-09-17 DOI: 10.1101/2024.08.20.608773
Xiaotian Shen, Xiaoyun Zhang
Spatial techniques such as spatial transcriptomes and MALDI-MSI, offering insights into both transcripts and metabolite of tissue sections. However, integrating them with high accuracy is challenge due to no shared spots or features. We present haCCA, a workflow designed to integrate spatial transcriptomes and metabolomes data using high-correlated feature pairs and modified spatial morphological alignment. This approach ensures high-resolution and accurate spot-to-spot data integration across neighbor tissue section. We applied haCCA to both publicly available 10X Visium and MALDI-MSI datasets from mouse brain tissue and a custom spatial transcriptome and MALDI-MSI dataset from an intrahepatic cholangiocarcinoma (ICC) model, exploring the metabolic alteration of NETs(neutrophil extracellular traps) on ICC, and finding a potential mechanism that NETs upregulated Scd1 to activate fatty acid metabolism. Providing new insights into the dynamic crosstalk between genes and metabolites that regulates the tumor biological behavior and drives the response to treatment. We developed and published an easy-to-use Python package to facilitate its use.
空间技术,如空间转录本组和 MALDI-MSI,可以深入了解组织切片的转录本和代谢物。然而,由于没有共享的点或特征,如何高精度地整合它们是一个挑战。我们提出的 haCCA 是一种工作流程,旨在利用高相关特征对和改进的空间形态比对整合空间转录组和代谢组数据。这种方法可确保在邻近组织切片上实现高分辨率和准确的点对点数据整合。我们将 haCCA 应用于公开的小鼠脑组织 10X Visium 和 MALDI-MSI 数据集,以及一个肝内胆管癌(ICC)模型的定制空间转录组和 MALDI-MSI 数据集,探索了中性粒细胞胞外捕获物(NETs)对 ICC 的代谢改变,发现了 NETs 上调 Scd1 激活脂肪酸代谢的潜在机制。我们对基因和代谢物之间的动态串联提供了新的见解,这种串联调节了肿瘤的生物学行为并驱动了对治疗的反应。我们开发并发布了一个易于使用的 Python 软件包,以方便使用。
{"title":"haCCA: Multi-module Integrating of spatial transcriptomes and metabolomes.","authors":"Xiaotian Shen, Xiaoyun Zhang","doi":"10.1101/2024.08.20.608773","DOIUrl":"https://doi.org/10.1101/2024.08.20.608773","url":null,"abstract":"Spatial techniques such as spatial transcriptomes and MALDI-MSI, offering insights into both transcripts and metabolite of tissue sections. However, integrating them with high accuracy is challenge due to no shared spots or features. We present haCCA, a workflow designed to integrate spatial transcriptomes and metabolomes data using high-correlated feature pairs and modified spatial morphological alignment. This approach ensures high-resolution and accurate spot-to-spot data integration across neighbor tissue section. We applied haCCA to both publicly available 10X Visium and MALDI-MSI datasets from mouse brain tissue and a custom spatial transcriptome and MALDI-MSI dataset from an intrahepatic cholangiocarcinoma (ICC) model, exploring the metabolic alteration of NETs(neutrophil extracellular traps) on ICC, and finding a potential mechanism that NETs upregulated Scd1 to activate fatty acid metabolism. Providing new insights into the dynamic crosstalk between genes and metabolites that regulates the tumor biological behavior and drives the response to treatment. We developed and published an easy-to-use Python package to facilitate its use.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Precise Basecalling of Short-Read Nanopore Sequencing 短读数纳米孔测序的精确基数调用
Pub Date : 2024-09-17 DOI: 10.1101/2024.09.12.612746
Ziyuan Wang, Mei-Juan Tu, Chengcheng Song, Ziyang Liu, Katherine K Wang, Shuibing Chen, Ai-Ming Yu, HONGXU DING
The nanopore sequencing of short sequences, whose lengths are typically less than 0.3kb therefore comparable with Illumina sequencing techniques, has recently gained wide attention. Here, we design a scheme for training nanopore basecallers that are specialized for short biomolecules. With bioengineered RNA (BioRNA) molecules as examples, we demonstrate the superior accuracy of basecallers trained by our scheme.
短序列的长度通常小于 0.3kb,因此可与 Illumina 测序技术相媲美。在此,我们设计了一种专门针对短生物大分子的纳米孔唤基器训练方案。以生物工程 RNA(BioRNA)分子为例,我们展示了用我们的方案训练出的基底捕获器的卓越准确性。
{"title":"The Precise Basecalling of Short-Read Nanopore Sequencing","authors":"Ziyuan Wang, Mei-Juan Tu, Chengcheng Song, Ziyang Liu, Katherine K Wang, Shuibing Chen, Ai-Ming Yu, HONGXU DING","doi":"10.1101/2024.09.12.612746","DOIUrl":"https://doi.org/10.1101/2024.09.12.612746","url":null,"abstract":"The nanopore sequencing of short sequences, whose lengths are typically less than 0.3kb therefore comparable with Illumina sequencing techniques, has recently gained wide attention. Here, we design a scheme for training nanopore basecallers that are specialized for short biomolecules. With bioengineered RNA (BioRNA) molecules as examples, we demonstrate the superior accuracy of basecallers trained by our scheme.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PangeBlocks: customized construction of pangenome graphs via maximal blocks PangeBlocks:通过最大块定制构建泛基因组图谱
Pub Date : 2024-09-17 DOI: 10.1101/2024.09.17.613426
Paola Bonizzoni, Jorge Eduardo Avila Cartes, Simone Ciccolella, Gianluca Della Vedova, Luca Denti
Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicitthe underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.Methods: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.Results: We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs.In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
背景:构建庞基因组图是庞基因组学的一项基本任务。一个自然的理论问题是,如何将构建最优庞基因组图的计算问题形式化,从而明确基本的优化标准和可行解集。目前的方法是利用一些启发式方法构建庞基因组图,而不假定一些明确的优化标准。因此,具体的优化标准如何影响图拓扑和下游分析(如读图映射和变异调用)尚不清楚:本文利用多重序列比对(MSA)中最大区块的概念,将泛基因组图构建问题重构为区块上的精确覆盖问题,称为最小加权区块覆盖(MWBC)。然后,我们提出了 MWBC 问题的整数线性规划(ILP)公式,使我们能够研究构建图的最自然目标函数:结果:我们提供了解决 MWBC 的 ILP 方法的实现,并在 SARS-CoV-2 完整基因组上对其进行了评估,结果表明不同的目标函数会导致具有不同属性的 pangenome 图,这表明特定的下游任务可以驱动图构建阶段:我们的 MWBC 问题形式化是基于找到覆盖 MSA 的最优块子集,它为 MSA 图表达的新型实用方法铺平了道路,在这种方法中,用户可以指导图的构建。
{"title":"PangeBlocks: customized construction of pangenome graphs via maximal blocks","authors":"Paola Bonizzoni, Jorge Eduardo Avila Cartes, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1101/2024.09.17.613426","DOIUrl":"https://doi.org/10.1101/2024.09.17.613426","url":null,"abstract":"Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit\u0000the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.\u0000Methods: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.\u0000Results: We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.\u0000Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs.\u0000In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
easybio: an R Package for Single-Cell Annotation with CellMarker2.0 easybio:使用 CellMarker2.0 进行单细胞注释的 R 软件包
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.14.609619
Cui Wei
Single-cell RNA sequencing (scRNA-seq) allows researchers to study biological activities at the cellular level, enabling the discovery of new cell types and the analysis of intercellular interactions. However, annotating cell types in scRNA-seq data is a crucial and time-consuming process, with its quality significantly influencing downstream analyses. Accurate identification of potential cell types provides valuable insights for discovering new cell populations or identifying novel markers for known cells, which may be utilized in future research. While various methods exist for single-cell annotation, one of the most common approaches is to use known cell markers. The CellMarker2.0 database, a human-curated repository of cell markers extracted from published articles, is widely used for this purpose. However, it currently offers only a web-based tool for usage, which can be inconvenient when integrating with workflows like Seurat. To address this limitation, we introduce easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster. In addition to single-cell annotation, the package also supports various bioinformatics workflows, including RNA-seq analysis, making it a versatile tool for transcriptomic research.
单细胞 RNA 测序(scRNA-seq)使研究人员能够研究细胞水平的生物活动,从而发现新的细胞类型并分析细胞间的相互作用。然而,在 scRNA-seq 数据中标注细胞类型是一个关键而耗时的过程,其质量会对下游分析产生重大影响。准确鉴定潜在的细胞类型可为发现新细胞群或鉴定已知细胞的新标记物提供有价值的见解,这些标记物可用于未来的研究。虽然单细胞注释的方法多种多样,但最常用的方法之一是使用已知的细胞标记。CellMarker2.0数据库是从发表的文章中提取的细胞标记物,是一个由人类编辑的细胞标记物资源库,为此目的被广泛使用。不过,目前它只提供了一个基于网络的使用工具,在与 Seurat 等工作流程整合时可能会有不便。为了解决这一局限性,我们引入了easybio,这是一个R软件包,旨在结合Seurat使用CellMarker2.0数据库简化单细胞注释。easybio提供了一套函数,用于在本地查询CellMarker2.0数据库,深入了解每个群组的潜在细胞类型。除单细胞注释外,该软件包还支持各种生物信息学工作流,包括 RNA-seq 分析,是转录组研究的多功能工具。
{"title":"easybio: an R Package for Single-Cell Annotation with CellMarker2.0","authors":"Cui Wei","doi":"10.1101/2024.09.14.609619","DOIUrl":"https://doi.org/10.1101/2024.09.14.609619","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) allows researchers to study biological activities at the cellular level, enabling the discovery of new cell types and the analysis of intercellular interactions. However, annotating cell types in scRNA-seq data is a crucial and time-consuming process, with its quality significantly influencing downstream analyses. Accurate identification of potential cell types provides valuable insights for discovering new cell populations or identifying novel markers for known cells, which may be utilized in future research. While various methods exist for single-cell annotation, one of the most common approaches is to use known cell markers. The CellMarker2.0 database, a human-curated repository of cell markers extracted from published articles, is widely used for this purpose. However, it currently offers only a web-based tool for usage, which can be inconvenient when integrating with workflows like Seurat. To address this limitation, we introduce easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster. In addition to single-cell annotation, the package also supports various bioinformatics workflows, including RNA-seq analysis, making it a versatile tool for transcriptomic research.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing differential cell composition in single-cell studies using voomCLR 使用 voomCLR 评估单细胞研究中的细胞组成差异
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.12.612645
Alemu Takele Assefa, Bie Verbist, Koen Van den Berge
In single-cell studies, a common question is whether there is a change in cell composition between conditions. While ideally, one needs absolute cell counts (number of cells per volumetric unit in a sample) to address these questions, current experimentation typically obtains cell counts that only carry relative information. It is therefore crucial to account for the compositional nature of cell count data in the statistical analysis. While recently developed methods address compositionality using compositional transformations together with a bias correction, they do not account for the uncertainty involved in estimation of the bias term, nor do they accommodate the mean-variance structure of the counts. Here, we introduce a statistical method, voomCLR, for assessing differences in cell composition between conditions incorporating both uncertainty on the bias term as well as acknowledging the mean-variance structure of the transformed data, by leveraging developments from the differential gene expression literature. We demonstrate the performances of voomCLR, illustrate the benefit of all components and compare the methodology to the state-of-the-art on simulated and real single-cell gene expression datasets.
在单细胞研究中,一个常见的问题是不同条件下细胞组成是否发生变化。理想情况下,我们需要绝对的细胞计数(样本中每容积单位的细胞数)来解决这些问题,但目前的实验通常只能获得相对信息的细胞计数。因此,在统计分析中考虑细胞计数数据的组成性质至关重要。虽然最近开发的方法利用组成变换和偏差校正来解决组成性问题,但这些方法没有考虑偏差项估算中的不确定性,也没有考虑计数的均方差结构。在这里,我们介绍一种统计方法--voomCLR,用于评估不同条件下细胞组成的差异,这种方法既考虑了偏倚项的不确定性,也承认了转换数据的均方差结构,并充分利用了差异基因表达文献的发展成果。我们展示了 voomCLR 的性能,说明了所有组件的益处,并在模拟和真实单细胞基因表达数据集上将该方法与最先进的方法进行了比较。
{"title":"Assessing differential cell composition in single-cell studies using voomCLR","authors":"Alemu Takele Assefa, Bie Verbist, Koen Van den Berge","doi":"10.1101/2024.09.12.612645","DOIUrl":"https://doi.org/10.1101/2024.09.12.612645","url":null,"abstract":"In single-cell studies, a common question is whether there is a change in cell composition between conditions. While ideally, one needs absolute cell counts (number of cells per volumetric unit in a sample) to address these questions, current experimentation typically obtains cell counts that only carry relative information. It is therefore crucial to account for the compositional nature of cell count data in the statistical analysis. While recently developed methods address compositionality using compositional transformations together with a bias correction, they do not account for the uncertainty involved in estimation of the bias term, nor do they accommodate the mean-variance structure of the counts. Here, we introduce a statistical method, voomCLR, for assessing differences in cell composition between conditions incorporating both uncertainty on the bias term as well as acknowledging the mean-variance structure of the transformed data, by leveraging developments from the differential gene expression literature. We demonstrate the performances of voomCLR, illustrate the benefit of all components and compare the methodology to the state-of-the-art on simulated and real single-cell gene expression datasets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation 通过基因速度估算优化运输揭示动态基因调控网络
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.12.612590
Wenjun Zhao, Erica Larschan, Bjorn Sandstede, Ritambhara Singh
Inferring gene regulatory networks from gene expression data is an important and challenging problem in the biology community. We propose OTVelo, a methodology that takes time-stamped single-cell gene expression data as input and predicts gene regulation across two time points. It is known that the rate of change of gene expression, which we will refer to as gene velocity, provides crucial information that enhances such inference; however, this information is not always available due to the limitations in sequencing depth. Our algorithm overcomes this limitation by estimating gene velocities using optimal transport. We then infer gene regulation using time-lagged correlation and Granger causality via regularized linear regression. Instead of providing an aggregated network across all time points, our method uncovers the underlying dynamical mechanism across time points. We validate our algorithm on 13 simulated datasets with both synthetic and curated networks and demonstrate its efficacy on 4 experimental data sets.
从基因表达数据推断基因调控网络是生物学界一个重要而又具有挑战性的问题。我们提出的 OTVelo 是一种以时间戳单细胞基因表达数据为输入,预测两个时间点基因调控的方法。众所周知,基因表达的变化率(我们将其称为基因速度)提供了增强这种推断的关键信息;然而,由于测序深度的限制,这种信息并不总是可用的。我们的算法克服了这一局限性,利用最优传输估算基因速度。然后,我们通过正则化线性回归,利用时滞相关性和格兰杰因果关系推断基因调控。我们的方法不是提供跨所有时间点的聚合网络,而是揭示跨时间点的潜在动态机制。我们在 13 个模拟数据集上验证了我们的算法,其中既有合成网络,也有策划网络,并在 4 个实验数据集上证明了它的有效性。
{"title":"Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation","authors":"Wenjun Zhao, Erica Larschan, Bjorn Sandstede, Ritambhara Singh","doi":"10.1101/2024.09.12.612590","DOIUrl":"https://doi.org/10.1101/2024.09.12.612590","url":null,"abstract":"Inferring gene regulatory networks from gene expression data is an important and challenging problem in the biology community. We propose OTVelo, a methodology that takes time-stamped single-cell gene expression data as input and predicts gene regulation across two time points. It is known that the rate of change of gene expression, which we will refer to as gene velocity, provides crucial information that enhances such inference; however, this information is not always available due to the limitations in sequencing depth. Our algorithm overcomes this limitation by estimating gene velocities using optimal transport. We then infer gene regulation using time-lagged correlation and Granger causality via regularized linear regression. Instead of providing an aggregated network across all time points, our method uncovers the underlying dynamical mechanism across time points. We validate our algorithm on 13 simulated datasets with both synthetic and curated networks and demonstrate its efficacy on 4 experimental data sets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
gsQTL: Associating genetic risk variants with gene sets by exploiting their shared variability gsQTL:利用基因组的共享变异性,将遗传风险变异与基因组联系起来
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.13.612853
Gerard A Bouland, Niccolo Tesi, Ahmed Mahfouz, Marcel Reinders
To investigate the functional significance of genetic risk loci identified through genome-wide association studies (GWASs), genetic loci are linked to genes based on their capacity to account for variation in gene expression, resulting in expression quantitative trait loci (eQTL). Following this, gene set analyses are commonly used to gain insights into functionality. However, the efficacy of this approach is hampered by small effect sizes and the burden of multiple testing. We propose an alternative approach: instead of examining the cumulative associations of individual genes within a gene set, we consider the collective variation of the entire gene set. We introduce the concept of gene set QTL (gsQTL), and show it to be more adept at identifying links between genetic risk variants and specific gene sets. Notably, gsQTL experiences less susceptibility to inflation or deflation of significant enrichments compared with conventional methods. Furthermore, we demonstrate the broader applicability of shared variability within gene sets. This is evident in scenarios such as the coordinated regulation of genes by a transcription factor or coordinated differential expression.
为了研究通过全基因组关联研究(GWAS)确定的遗传风险位点的功能意义,根据基因表达变异的能力将遗传位点与基因联系起来,形成表达定量性状位点(eQTL)。随后,基因组分析通常用于深入了解基因的功能。然而,这种方法的有效性受到效应大小小和多重测试负担的影响。我们提出了另一种方法:我们不研究基因集中单个基因的累积关联,而是考虑整个基因集的集体变异。我们引入了基因组 QTL(gsQTL)的概念,并证明它更善于识别遗传风险变异与特定基因组之间的联系。值得注意的是,与传统方法相比,gsQTL 不易受显著富集度膨胀或缩小的影响。此外,我们还证明了基因集内共享变异的广泛适用性。这在转录因子对基因的协调调控或协调差异表达等情况中都很明显。
{"title":"gsQTL: Associating genetic risk variants with gene sets by exploiting their shared variability","authors":"Gerard A Bouland, Niccolo Tesi, Ahmed Mahfouz, Marcel Reinders","doi":"10.1101/2024.09.13.612853","DOIUrl":"https://doi.org/10.1101/2024.09.13.612853","url":null,"abstract":"To investigate the functional significance of genetic risk loci identified through genome-wide association studies (GWASs), genetic loci are linked to genes based on their capacity to account for variation in gene expression, resulting in expression quantitative trait loci (eQTL). Following this, gene set analyses are commonly used to gain insights into functionality. However, the efficacy of this approach is hampered by small effect sizes and the burden of multiple testing. We propose an alternative approach: instead of examining the cumulative associations of individual genes within a gene set, we consider the collective variation of the entire gene set. We introduce the concept of gene set QTL (gsQTL), and show it to be more adept at identifying links between genetic risk variants and specific gene sets. Notably, gsQTL experiences less susceptibility to inflation or deflation of significant enrichments compared with conventional methods. Furthermore, we demonstrate the broader applicability of shared variability within gene sets. This is evident in scenarios such as the coordinated regulation of genes by a transcription factor or coordinated differential expression.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeOri 10.0: An Updated Database of Experimentally Identified Eukaryotic Replication Origins DeOri 10.0:经实验鉴定的真核生物复制起源数据库更新版
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.12.612581
Yu-Hao Zeng, Zhen-Ning Yin, Hao Luo, Feng Gao
DNA replication is a complex and crucial biological process in eukaryotes. To facilitate the study of eukaryotic replication events, we present database of eukaryotic DNA replication origins (DeOri), a database that collects scattered data and integrates extensive sequencing data on eukaryotic DNA replication origins. With continuous updates of DeOri, the number of datasets in the new release increased from 10 to 151 and the number of sequences increased from 16,145 to 9,742,396. Besides nucleotide sequences and bed files, corresponding annotation files, such as coding sequences (CDS), mRNA, and other biological elements within replication origins, are also provided. The experimental techniques used for each dataset, as well as other statistical data, are also presented on web page. Differences in experimental methods, cell lines, and sequencing technologies have resulted in distinct replication origins, making it challenging to differentiate between cell-specific and non-specific replication. We combined multiple replication origins at the species level, scored them, and screened them. The screened regions were considered as species-conservative origins. They are integrated and presented as reference replication origins (rORIs), including Homo sapiens, Gallus gallus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans. Additionally, we analyzed the distribution of relevant genomic elements associated with replication origins at the genome level, such as CpG island (CGI), transcription start site (TSS), and G-quadruplex (G4). These analysis results allow users to select the required data based on it. DeOri is available at http://tubic.tju.edu.cn/deori10/.
DNA 复制是真核生物复杂而关键的生物学过程。为了促进对真核生物复制事件的研究,我们建立了真核生物 DNA 复制起源数据库(DeOri),该数据库收集了真核生物 DNA 复制起源的零散数据,并整合了大量的测序数据。随着DeOri的不断更新,新版本的数据集数量从10个增加到151个,序列数量从16,145条增加到9,742,396条。除了核苷酸序列和床文件外,还提供了相应的注释文件,如编码序列(CDS)、mRNA 和复制起源内的其他生物元素。每个数据集所使用的实验技术以及其他统计数据也在网页上提供。实验方法、细胞系和测序技术的不同导致了不同的复制起源,这使得区分细胞特异性复制和非特异性复制具有挑战性。我们在物种水平上结合了多个复制起源,对它们进行了评分和筛选。筛选出的区域被视为物种保守起源。它们被整合为参考复制起源(rORIs),包括智人、斑马鸡、麝鼠、黑腹果蝇和高脚伊蚊。此外,我们还在基因组水平上分析了与复制起源有关的相关基因组元素的分布,如 CpG 岛(CGI)、转录起始位点(TSS)和 G-四叠体(G4)。用户可以根据这些分析结果选择所需的数据。DeOri可在http://tubic.tju.edu.cn/deori10/。
{"title":"DeOri 10.0: An Updated Database of Experimentally Identified Eukaryotic Replication Origins","authors":"Yu-Hao Zeng, Zhen-Ning Yin, Hao Luo, Feng Gao","doi":"10.1101/2024.09.12.612581","DOIUrl":"https://doi.org/10.1101/2024.09.12.612581","url":null,"abstract":"DNA replication is a complex and crucial biological process in eukaryotes. To facilitate the study of eukaryotic replication events, we present database of eukaryotic DNA replication origins (DeOri), a database that collects scattered data and integrates extensive sequencing data on eukaryotic DNA replication origins. With continuous updates of DeOri, the number of datasets in the new release increased from 10 to 151 and the number of sequences increased from 16,145 to 9,742,396. Besides nucleotide sequences and bed files, corresponding annotation files, such as coding sequences (CDS), mRNA, and other biological elements within replication origins, are also provided. The experimental techniques used for each dataset, as well as other statistical data, are also presented on web page. Differences in experimental methods, cell lines, and sequencing technologies have resulted in distinct replication origins, making it challenging to differentiate between cell-specific and non-specific replication. We combined multiple replication origins at the species level, scored them, and screened them. The screened regions were considered as species-conservative origins. They are integrated and presented as reference replication origins (rORIs), including Homo sapiens, Gallus gallus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans. Additionally, we analyzed the distribution of relevant genomic elements associated with replication origins at the genome level, such as CpG island (CGI), transcription start site (TSS), and G-quadruplex (G4). These analysis results allow users to select the required data based on it. DeOri is available at http://tubic.tju.edu.cn/deori10/.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning Predicts Non-Normal Peptide FAIMS Mobility Distributions Directly from Sequence 深度学习直接从序列预测非正态性多肽 FAIMS 迁移率分布
Pub Date : 2024-09-16 DOI: 10.1101/2024.09.11.612538
Justin McKetney, Ian J Miller, Alexandre Hutton, Pavel Sinitcyn, Joshua J Coon, Jesse G Meyer
Peptide ion mobility adds an extra dimension of separation to mass spectrometry-based proteomics. The ability to accurately predict peptide ion mobility would be useful to expedite assay development and to discriminate true answers in data-base search. There are methods to accurately predict peptide ion mobility through drift tube devices, but methods to predict mobility through high-field asymmetric waveform ion mobility (FAIMS) are underexplored. Here, we successfully model peptide ions' FAIMS mobility using a multi-label multi-output classification scheme to account for non-normal transmission distributions. We trained two models from over 100,000 human peptide precursors: a random forest and a long-term short-term memory (LSTM) neural network. Both models had different strengths, and the ensemble average of model predictions produced higher F2 score than either model alone. Finally, we explore cases where the models make mistakes and demonstrate predictive performance of F2=0.66 (AUROC=0.928) on a new test dataset of nearly 40,000 different E. coli peptide ions. The deep learning model is easily accessible via https://faims.xods.org.
肽离子迁移率为基于质谱的蛋白质组学增加了一个额外的分离维度。准确预测肽离子迁移率的能力将有助于加快检测方法的开发,并在数据库搜索中分辨出真正的答案。目前已有通过漂移管装置准确预测肽离子迁移率的方法,但通过高场非对称波形离子迁移率(FAIMS)预测迁移率的方法尚未得到充分探索。在此,我们使用多标签多输出分类方案成功地建立了肽离子 FAIMS 迁移率模型,以考虑非正态传输分布。我们从 100,000 多个人类肽前体中训练了两个模型:随机森林和长期短期记忆(LSTM)神经网络。两种模型的优势各不相同,模型预测的集合平均值产生的 F2 分数高于单独使用其中一种模型的结果。最后,我们探讨了模型犯错的情况,并在一个包含近 40,000 个不同大肠杆菌肽离子的新测试数据集上展示了 F2=0.66 (AUROC=0.928) 的预测性能。深度学习模型可通过 https://faims.xods.org 轻松访问。
{"title":"Deep Learning Predicts Non-Normal Peptide FAIMS Mobility Distributions Directly from Sequence","authors":"Justin McKetney, Ian J Miller, Alexandre Hutton, Pavel Sinitcyn, Joshua J Coon, Jesse G Meyer","doi":"10.1101/2024.09.11.612538","DOIUrl":"https://doi.org/10.1101/2024.09.11.612538","url":null,"abstract":"Peptide ion mobility adds an extra dimension of separation to mass spectrometry-based proteomics. The ability to accurately predict peptide ion mobility would be useful to expedite assay development and to discriminate true answers in data-base search. There are methods to accurately predict peptide ion mobility through drift tube devices, but methods to predict mobility through high-field asymmetric waveform ion mobility (FAIMS) are underexplored. Here, we successfully model peptide ions' FAIMS mobility using a multi-label multi-output classification scheme to account for non-normal transmission distributions. We trained two models from over 100,000 human peptide precursors: a random forest and a long-term short-term memory (LSTM) neural network. Both models had different strengths, and the ensemble average of model predictions produced higher F2 score than either model alone. Finally, we explore cases where the models make mistakes and demonstrate predictive performance of F2=0.66 (AUROC=0.928) on a new test dataset of nearly 40,000 different E. coli peptide ions. The deep learning model is easily accessible via https://faims.xods.org.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
bioRxiv - Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1