Pub Date : 2024-09-17DOI: 10.1101/2024.09.16.613146
Matheus Miguel Soares de Medeiros Lima, Janira Prichula, Tetsu Sakamoto
Enterococcus casseliflavus, a commonly mobile and yellow-colored bacterium, is a commensal member of the gastrointestinal tract. It is occasionally found in cases of bacteremia and other human infections. A concern is that all strains of this species have the vanC gene group on their chromosome, which confers resistance to vancomycin. The classification of E. casseliflavus is challenging, as it presents 99% identity in 16S analysis with E. gallinarum and, mainly, with E. flavescens, often being classified as a single species. This study aimed to revisit the taxonomy of E. casseliflavus and other related species by carrying out a comprehensive analysis of the genomic data available for these species in public databases.analyzing the genomic data. For this, 155 genomes of E. casseliflavus related species (E. casseliflavus, E. flavescens, E. entomosocium, and E. innesii) were retrieved and submitted to Average Nucleotide Identity (ANI) and phylogenomic analysis. Both approaches showed three well-delineated clusters which correspond to three Enterococcus species (E. casseliflavus, E. flavescens and E. innesii). Here we suggest (1) the removal of synonym status between E. flavescens and E. cassliflavus, and (2) addition of synonym status between E. entomosocium and E. casseliflavus.
卡氏肠球菌(Enterococcus casseliflavus)是一种常见的流动性黄色细菌,是胃肠道中的共生菌。它偶尔会出现在菌血症和其他人类感染病例中。一个令人担忧的问题是,该物种的所有菌株染色体上都有 VanC 基因组,从而对万古霉素产生抗药性。E.casseliflavus的分类具有挑战性,因为它与E.gallinarum(主要是E.flavescens)在16S分析中的同一性高达99%,经常被归类为单一物种。本研究旨在通过全面分析这些物种在公共数据库中的基因组数据,重新审视E. casseliflavus及其他相关物种的分类。为此,我们检索了155个E. casseliflavus相关物种(E. casseliflavus、E. flavescens、E. entomosocium和E. innesii)的基因组,并对其进行了平均核苷酸同一性(ANI)和系统发生组分析。这两种方法都显示出三个界限分明的聚类,分别对应三个肠球菌种(E. casseliflavus、E. flavescens 和 E.innesii)。在此,我们建议:(1)取消 E. flavescens 和 E. cassliflavus 之间的同义词地位;(2)增加 E. entomosocium 和 E. casseliflavus 之间的同义词地位。
{"title":"Revisiting the taxonomy of Enterococcus casseliflavus and related species","authors":"Matheus Miguel Soares de Medeiros Lima, Janira Prichula, Tetsu Sakamoto","doi":"10.1101/2024.09.16.613146","DOIUrl":"https://doi.org/10.1101/2024.09.16.613146","url":null,"abstract":"Enterococcus casseliflavus, a commonly mobile and yellow-colored bacterium, is a commensal member of the gastrointestinal tract. It is occasionally found in cases of bacteremia and other human infections. A concern is that all strains of this species have the vanC gene group on their chromosome, which confers resistance to vancomycin. The classification of E. casseliflavus is challenging, as it presents 99% identity in 16S analysis with E. gallinarum and, mainly, with E. flavescens, often being classified as a single species. This study aimed to revisit the taxonomy of E. casseliflavus and other related species by carrying out a comprehensive analysis of the genomic data available for these species in public databases.analyzing the genomic data. For this, 155 genomes of E. casseliflavus related species (E. casseliflavus, E. flavescens, E. entomosocium, and E. innesii) were retrieved and submitted to Average Nucleotide Identity (ANI) and phylogenomic analysis. Both approaches showed three well-delineated clusters which correspond to three Enterococcus species (E. casseliflavus, E. flavescens and E. innesii). Here we suggest (1) the removal of synonym status between E. flavescens and E. cassliflavus, and (2) addition of synonym status between E. entomosocium and E. casseliflavus.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"207 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1101/2024.08.20.608773
Xiaotian Shen, Xiaoyun Zhang
Spatial techniques such as spatial transcriptomes and MALDI-MSI, offering insights into both transcripts and metabolite of tissue sections. However, integrating them with high accuracy is challenge due to no shared spots or features. We present haCCA, a workflow designed to integrate spatial transcriptomes and metabolomes data using high-correlated feature pairs and modified spatial morphological alignment. This approach ensures high-resolution and accurate spot-to-spot data integration across neighbor tissue section. We applied haCCA to both publicly available 10X Visium and MALDI-MSI datasets from mouse brain tissue and a custom spatial transcriptome and MALDI-MSI dataset from an intrahepatic cholangiocarcinoma (ICC) model, exploring the metabolic alteration of NETs(neutrophil extracellular traps) on ICC, and finding a potential mechanism that NETs upregulated Scd1 to activate fatty acid metabolism. Providing new insights into the dynamic crosstalk between genes and metabolites that regulates the tumor biological behavior and drives the response to treatment. We developed and published an easy-to-use Python package to facilitate its use.
{"title":"haCCA: Multi-module Integrating of spatial transcriptomes and metabolomes.","authors":"Xiaotian Shen, Xiaoyun Zhang","doi":"10.1101/2024.08.20.608773","DOIUrl":"https://doi.org/10.1101/2024.08.20.608773","url":null,"abstract":"Spatial techniques such as spatial transcriptomes and MALDI-MSI, offering insights into both transcripts and metabolite of tissue sections. However, integrating them with high accuracy is challenge due to no shared spots or features. We present haCCA, a workflow designed to integrate spatial transcriptomes and metabolomes data using high-correlated feature pairs and modified spatial morphological alignment. This approach ensures high-resolution and accurate spot-to-spot data integration across neighbor tissue section. We applied haCCA to both publicly available 10X Visium and MALDI-MSI datasets from mouse brain tissue and a custom spatial transcriptome and MALDI-MSI dataset from an intrahepatic cholangiocarcinoma (ICC) model, exploring the metabolic alteration of NETs(neutrophil extracellular traps) on ICC, and finding a potential mechanism that NETs upregulated Scd1 to activate fatty acid metabolism. Providing new insights into the dynamic crosstalk between genes and metabolites that regulates the tumor biological behavior and drives the response to treatment. We developed and published an easy-to-use Python package to facilitate its use.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The nanopore sequencing of short sequences, whose lengths are typically less than 0.3kb therefore comparable with Illumina sequencing techniques, has recently gained wide attention. Here, we design a scheme for training nanopore basecallers that are specialized for short biomolecules. With bioengineered RNA (BioRNA) molecules as examples, we demonstrate the superior accuracy of basecallers trained by our scheme.
{"title":"The Precise Basecalling of Short-Read Nanopore Sequencing","authors":"Ziyuan Wang, Mei-Juan Tu, Chengcheng Song, Ziyang Liu, Katherine K Wang, Shuibing Chen, Ai-Ming Yu, HONGXU DING","doi":"10.1101/2024.09.12.612746","DOIUrl":"https://doi.org/10.1101/2024.09.12.612746","url":null,"abstract":"The nanopore sequencing of short sequences, whose lengths are typically less than 0.3kb therefore comparable with Illumina sequencing techniques, has recently gained wide attention. Here, we design a scheme for training nanopore basecallers that are specialized for short biomolecules. With bioengineered RNA (BioRNA) molecules as examples, we demonstrate the superior accuracy of basecallers trained by our scheme.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. Methods: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. Results: We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
{"title":"PangeBlocks: customized construction of pangenome graphs via maximal blocks","authors":"Paola Bonizzoni, Jorge Eduardo Avila Cartes, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1101/2024.09.17.613426","DOIUrl":"https://doi.org/10.1101/2024.09.17.613426","url":null,"abstract":"Background: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit\u0000the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.\u0000Methods: In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.\u0000Results: We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.\u0000Conclusion: We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs.\u0000In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1101/2024.09.14.609619
Cui Wei
Single-cell RNA sequencing (scRNA-seq) allows researchers to study biological activities at the cellular level, enabling the discovery of new cell types and the analysis of intercellular interactions. However, annotating cell types in scRNA-seq data is a crucial and time-consuming process, with its quality significantly influencing downstream analyses. Accurate identification of potential cell types provides valuable insights for discovering new cell populations or identifying novel markers for known cells, which may be utilized in future research. While various methods exist for single-cell annotation, one of the most common approaches is to use known cell markers. The CellMarker2.0 database, a human-curated repository of cell markers extracted from published articles, is widely used for this purpose. However, it currently offers only a web-based tool for usage, which can be inconvenient when integrating with workflows like Seurat. To address this limitation, we introduce easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster. In addition to single-cell annotation, the package also supports various bioinformatics workflows, including RNA-seq analysis, making it a versatile tool for transcriptomic research.
{"title":"easybio: an R Package for Single-Cell Annotation with CellMarker2.0","authors":"Cui Wei","doi":"10.1101/2024.09.14.609619","DOIUrl":"https://doi.org/10.1101/2024.09.14.609619","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) allows researchers to study biological activities at the cellular level, enabling the discovery of new cell types and the analysis of intercellular interactions. However, annotating cell types in scRNA-seq data is a crucial and time-consuming process, with its quality significantly influencing downstream analyses. Accurate identification of potential cell types provides valuable insights for discovering new cell populations or identifying novel markers for known cells, which may be utilized in future research. While various methods exist for single-cell annotation, one of the most common approaches is to use known cell markers. The CellMarker2.0 database, a human-curated repository of cell markers extracted from published articles, is widely used for this purpose. However, it currently offers only a web-based tool for usage, which can be inconvenient when integrating with workflows like Seurat. To address this limitation, we introduce easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster. In addition to single-cell annotation, the package also supports various bioinformatics workflows, including RNA-seq analysis, making it a versatile tool for transcriptomic research.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1101/2024.09.12.612645
Alemu Takele Assefa, Bie Verbist, Koen Van den Berge
In single-cell studies, a common question is whether there is a change in cell composition between conditions. While ideally, one needs absolute cell counts (number of cells per volumetric unit in a sample) to address these questions, current experimentation typically obtains cell counts that only carry relative information. It is therefore crucial to account for the compositional nature of cell count data in the statistical analysis. While recently developed methods address compositionality using compositional transformations together with a bias correction, they do not account for the uncertainty involved in estimation of the bias term, nor do they accommodate the mean-variance structure of the counts. Here, we introduce a statistical method, voomCLR, for assessing differences in cell composition between conditions incorporating both uncertainty on the bias term as well as acknowledging the mean-variance structure of the transformed data, by leveraging developments from the differential gene expression literature. We demonstrate the performances of voomCLR, illustrate the benefit of all components and compare the methodology to the state-of-the-art on simulated and real single-cell gene expression datasets.
{"title":"Assessing differential cell composition in single-cell studies using voomCLR","authors":"Alemu Takele Assefa, Bie Verbist, Koen Van den Berge","doi":"10.1101/2024.09.12.612645","DOIUrl":"https://doi.org/10.1101/2024.09.12.612645","url":null,"abstract":"In single-cell studies, a common question is whether there is a change in cell composition between conditions. While ideally, one needs absolute cell counts (number of cells per volumetric unit in a sample) to address these questions, current experimentation typically obtains cell counts that only carry relative information. It is therefore crucial to account for the compositional nature of cell count data in the statistical analysis. While recently developed methods address compositionality using compositional transformations together with a bias correction, they do not account for the uncertainty involved in estimation of the bias term, nor do they accommodate the mean-variance structure of the counts. Here, we introduce a statistical method, voomCLR, for assessing differences in cell composition between conditions incorporating both uncertainty on the bias term as well as acknowledging the mean-variance structure of the transformed data, by leveraging developments from the differential gene expression literature. We demonstrate the performances of voomCLR, illustrate the benefit of all components and compare the methodology to the state-of-the-art on simulated and real single-cell gene expression datasets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inferring gene regulatory networks from gene expression data is an important and challenging problem in the biology community. We propose OTVelo, a methodology that takes time-stamped single-cell gene expression data as input and predicts gene regulation across two time points. It is known that the rate of change of gene expression, which we will refer to as gene velocity, provides crucial information that enhances such inference; however, this information is not always available due to the limitations in sequencing depth. Our algorithm overcomes this limitation by estimating gene velocities using optimal transport. We then infer gene regulation using time-lagged correlation and Granger causality via regularized linear regression. Instead of providing an aggregated network across all time points, our method uncovers the underlying dynamical mechanism across time points. We validate our algorithm on 13 simulated datasets with both synthetic and curated networks and demonstrate its efficacy on 4 experimental data sets.
{"title":"Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation","authors":"Wenjun Zhao, Erica Larschan, Bjorn Sandstede, Ritambhara Singh","doi":"10.1101/2024.09.12.612590","DOIUrl":"https://doi.org/10.1101/2024.09.12.612590","url":null,"abstract":"Inferring gene regulatory networks from gene expression data is an important and challenging problem in the biology community. We propose OTVelo, a methodology that takes time-stamped single-cell gene expression data as input and predicts gene regulation across two time points. It is known that the rate of change of gene expression, which we will refer to as gene velocity, provides crucial information that enhances such inference; however, this information is not always available due to the limitations in sequencing depth. Our algorithm overcomes this limitation by estimating gene velocities using optimal transport. We then infer gene regulation using time-lagged correlation and Granger causality via regularized linear regression. Instead of providing an aggregated network across all time points, our method uncovers the underlying dynamical mechanism across time points. We validate our algorithm on 13 simulated datasets with both synthetic and curated networks and demonstrate its efficacy on 4 experimental data sets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1101/2024.09.13.612853
Gerard A Bouland, Niccolo Tesi, Ahmed Mahfouz, Marcel Reinders
To investigate the functional significance of genetic risk loci identified through genome-wide association studies (GWASs), genetic loci are linked to genes based on their capacity to account for variation in gene expression, resulting in expression quantitative trait loci (eQTL). Following this, gene set analyses are commonly used to gain insights into functionality. However, the efficacy of this approach is hampered by small effect sizes and the burden of multiple testing. We propose an alternative approach: instead of examining the cumulative associations of individual genes within a gene set, we consider the collective variation of the entire gene set. We introduce the concept of gene set QTL (gsQTL), and show it to be more adept at identifying links between genetic risk variants and specific gene sets. Notably, gsQTL experiences less susceptibility to inflation or deflation of significant enrichments compared with conventional methods. Furthermore, we demonstrate the broader applicability of shared variability within gene sets. This is evident in scenarios such as the coordinated regulation of genes by a transcription factor or coordinated differential expression.
{"title":"gsQTL: Associating genetic risk variants with gene sets by exploiting their shared variability","authors":"Gerard A Bouland, Niccolo Tesi, Ahmed Mahfouz, Marcel Reinders","doi":"10.1101/2024.09.13.612853","DOIUrl":"https://doi.org/10.1101/2024.09.13.612853","url":null,"abstract":"To investigate the functional significance of genetic risk loci identified through genome-wide association studies (GWASs), genetic loci are linked to genes based on their capacity to account for variation in gene expression, resulting in expression quantitative trait loci (eQTL). Following this, gene set analyses are commonly used to gain insights into functionality. However, the efficacy of this approach is hampered by small effect sizes and the burden of multiple testing. We propose an alternative approach: instead of examining the cumulative associations of individual genes within a gene set, we consider the collective variation of the entire gene set. We introduce the concept of gene set QTL (gsQTL), and show it to be more adept at identifying links between genetic risk variants and specific gene sets. Notably, gsQTL experiences less susceptibility to inflation or deflation of significant enrichments compared with conventional methods. Furthermore, we demonstrate the broader applicability of shared variability within gene sets. This is evident in scenarios such as the coordinated regulation of genes by a transcription factor or coordinated differential expression.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1101/2024.09.12.612581
Yu-Hao Zeng, Zhen-Ning Yin, Hao Luo, Feng Gao
DNA replication is a complex and crucial biological process in eukaryotes. To facilitate the study of eukaryotic replication events, we present database of eukaryotic DNA replication origins (DeOri), a database that collects scattered data and integrates extensive sequencing data on eukaryotic DNA replication origins. With continuous updates of DeOri, the number of datasets in the new release increased from 10 to 151 and the number of sequences increased from 16,145 to 9,742,396. Besides nucleotide sequences and bed files, corresponding annotation files, such as coding sequences (CDS), mRNA, and other biological elements within replication origins, are also provided. The experimental techniques used for each dataset, as well as other statistical data, are also presented on web page. Differences in experimental methods, cell lines, and sequencing technologies have resulted in distinct replication origins, making it challenging to differentiate between cell-specific and non-specific replication. We combined multiple replication origins at the species level, scored them, and screened them. The screened regions were considered as species-conservative origins. They are integrated and presented as reference replication origins (rORIs), including Homo sapiens, Gallus gallus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans. Additionally, we analyzed the distribution of relevant genomic elements associated with replication origins at the genome level, such as CpG island (CGI), transcription start site (TSS), and G-quadruplex (G4). These analysis results allow users to select the required data based on it. DeOri is available at http://tubic.tju.edu.cn/deori10/.
DNA 复制是真核生物复杂而关键的生物学过程。为了促进对真核生物复制事件的研究,我们建立了真核生物 DNA 复制起源数据库(DeOri),该数据库收集了真核生物 DNA 复制起源的零散数据,并整合了大量的测序数据。随着DeOri的不断更新,新版本的数据集数量从10个增加到151个,序列数量从16,145条增加到9,742,396条。除了核苷酸序列和床文件外,还提供了相应的注释文件,如编码序列(CDS)、mRNA 和复制起源内的其他生物元素。每个数据集所使用的实验技术以及其他统计数据也在网页上提供。实验方法、细胞系和测序技术的不同导致了不同的复制起源,这使得区分细胞特异性复制和非特异性复制具有挑战性。我们在物种水平上结合了多个复制起源,对它们进行了评分和筛选。筛选出的区域被视为物种保守起源。它们被整合为参考复制起源(rORIs),包括智人、斑马鸡、麝鼠、黑腹果蝇和高脚伊蚊。此外,我们还在基因组水平上分析了与复制起源有关的相关基因组元素的分布,如 CpG 岛(CGI)、转录起始位点(TSS)和 G-四叠体(G4)。用户可以根据这些分析结果选择所需的数据。DeOri可在http://tubic.tju.edu.cn/deori10/。
{"title":"DeOri 10.0: An Updated Database of Experimentally Identified Eukaryotic Replication Origins","authors":"Yu-Hao Zeng, Zhen-Ning Yin, Hao Luo, Feng Gao","doi":"10.1101/2024.09.12.612581","DOIUrl":"https://doi.org/10.1101/2024.09.12.612581","url":null,"abstract":"DNA replication is a complex and crucial biological process in eukaryotes. To facilitate the study of eukaryotic replication events, we present database of eukaryotic DNA replication origins (DeOri), a database that collects scattered data and integrates extensive sequencing data on eukaryotic DNA replication origins. With continuous updates of DeOri, the number of datasets in the new release increased from 10 to 151 and the number of sequences increased from 16,145 to 9,742,396. Besides nucleotide sequences and bed files, corresponding annotation files, such as coding sequences (CDS), mRNA, and other biological elements within replication origins, are also provided. The experimental techniques used for each dataset, as well as other statistical data, are also presented on web page. Differences in experimental methods, cell lines, and sequencing technologies have resulted in distinct replication origins, making it challenging to differentiate between cell-specific and non-specific replication. We combined multiple replication origins at the species level, scored them, and screened them. The screened regions were considered as species-conservative origins. They are integrated and presented as reference replication origins (rORIs), including Homo sapiens, Gallus gallus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans. Additionally, we analyzed the distribution of relevant genomic elements associated with replication origins at the genome level, such as CpG island (CGI), transcription start site (TSS), and G-quadruplex (G4). These analysis results allow users to select the required data based on it. DeOri is available at http://tubic.tju.edu.cn/deori10/.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1101/2024.09.11.612538
Justin McKetney, Ian J Miller, Alexandre Hutton, Pavel Sinitcyn, Joshua J Coon, Jesse G Meyer
Peptide ion mobility adds an extra dimension of separation to mass spectrometry-based proteomics. The ability to accurately predict peptide ion mobility would be useful to expedite assay development and to discriminate true answers in data-base search. There are methods to accurately predict peptide ion mobility through drift tube devices, but methods to predict mobility through high-field asymmetric waveform ion mobility (FAIMS) are underexplored. Here, we successfully model peptide ions' FAIMS mobility using a multi-label multi-output classification scheme to account for non-normal transmission distributions. We trained two models from over 100,000 human peptide precursors: a random forest and a long-term short-term memory (LSTM) neural network. Both models had different strengths, and the ensemble average of model predictions produced higher F2 score than either model alone. Finally, we explore cases where the models make mistakes and demonstrate predictive performance of F2=0.66 (AUROC=0.928) on a new test dataset of nearly 40,000 different E. coli peptide ions. The deep learning model is easily accessible via https://faims.xods.org.
{"title":"Deep Learning Predicts Non-Normal Peptide FAIMS Mobility Distributions Directly from Sequence","authors":"Justin McKetney, Ian J Miller, Alexandre Hutton, Pavel Sinitcyn, Joshua J Coon, Jesse G Meyer","doi":"10.1101/2024.09.11.612538","DOIUrl":"https://doi.org/10.1101/2024.09.11.612538","url":null,"abstract":"Peptide ion mobility adds an extra dimension of separation to mass spectrometry-based proteomics. The ability to accurately predict peptide ion mobility would be useful to expedite assay development and to discriminate true answers in data-base search. There are methods to accurately predict peptide ion mobility through drift tube devices, but methods to predict mobility through high-field asymmetric waveform ion mobility (FAIMS) are underexplored. Here, we successfully model peptide ions' FAIMS mobility using a multi-label multi-output classification scheme to account for non-normal transmission distributions. We trained two models from over 100,000 human peptide precursors: a random forest and a long-term short-term memory (LSTM) neural network. Both models had different strengths, and the ensemble average of model predictions produced higher F2 score than either model alone. Finally, we explore cases where the models make mistakes and demonstrate predictive performance of F2=0.66 (AUROC=0.928) on a new test dataset of nearly 40,000 different E. coli peptide ions. The deep learning model is easily accessible via https://faims.xods.org.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}