arXiv - QuanBio - Genomics最新文献_第9页

F5C-finder: An Explainable and Ensemble Biological Language Model for Predicting 5-Formylcytidine Modifications on mRNA F5C-finder：用于预测 mRNA 上 5-甲酰基胞嘧啶修饰的可解释和集合生物语言模型

arXiv - QuanBio - Genomics

Pub Date : 2024-04-20 DOI: arxiv-2404.13265

Guohao Wang, Ting Liu, Hongqiang Lyu, Ze Liu

As a prevalent and dynamically regulated epigenetic modification,5-formylcytidine (f5C) is crucial in various biological processes. However,traditional experimental methods for f5C detection are often laborious andtime-consuming, limiting their ability to map f5C sites across thetranscriptome comprehensively. While computational approaches offer acost-effective and high-throughput alternative, no recognition model for f5Chas been developed to date. Drawing inspiration from language models in naturallanguage processing, this study presents f5C-finder, an ensemble neuralnetwork-based model utilizing multi-head attention for the identification off5C. Five distinct feature extraction methods were employed to construct fiveindividual artificial neural networks, and these networks were subsequentlyintegrated through ensemble learning to create f5C-finder. 10-foldcross-validation and independent tests demonstrate that f5C-finder achievesstate-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively.The result highlights the effectiveness of biological language model incapturing both the order (sequential) and functional meaning (semantics) withingenomes. Furthermore, the built-in interpretability allows us to understandwhat the model is learning, creating a bridge between identifying keysequential elements and a deeper exploration of their biological functions.

5-甲酰基胞嘧啶（f5C）是一种普遍存在且受动态调控的表观遗传修饰，在各种生物过程中至关重要。然而，检测 f5C 的传统实验方法往往费时费力，限制了它们在转录组中全面绘制 f5C 位点的能力。虽然计算方法提供了一种具有成本效益和高通量的替代方法，但迄今为止还没有开发出 f5C 的识别模型。本研究从自然语言处理中的语言模型中汲取灵感，提出了一种基于神经网络的集合模型--f5C-finder，它利用多头注意力来识别5C。研究采用了五种不同的特征提取方法来构建五个单独的人工神经网络，然后通过集合学习对这些网络进行整合，从而创建了 f5C-finder。10倍交叉验证和独立测试表明，f5C-finder的AUC分别为0.807和0.827，达到了最先进水平（SOTA）。此外，内置的可解释性使我们能够理解模型在学习什么，从而在识别关键序列元素和深入探索其生物功能之间架起了一座桥梁。

{"title":"F5C-finder: An Explainable and Ensemble Biological Language Model for Predicting 5-Formylcytidine Modifications on mRNA","authors":"Guohao Wang, Ting Liu, Hongqiang Lyu, Ze Liu","doi":"arxiv-2404.13265","DOIUrl":"https://doi.org/arxiv-2404.13265","url":null,"abstract":"As a prevalent and dynamically regulated epigenetic modification,\u00005-formylcytidine (f5C) is crucial in various biological processes. However,\u0000traditional experimental methods for f5C detection are often laborious and\u0000time-consuming, limiting their ability to map f5C sites across the\u0000transcriptome comprehensively. While computational approaches offer a\u0000cost-effective and high-throughput alternative, no recognition model for f5C\u0000has been developed to date. Drawing inspiration from language models in natural\u0000language processing, this study presents f5C-finder, an ensemble neural\u0000network-based model utilizing multi-head attention for the identification of\u0000f5C. Five distinct feature extraction methods were employed to construct five\u0000individual artificial neural networks, and these networks were subsequently\u0000integrated through ensemble learning to create f5C-finder. 10-fold\u0000cross-validation and independent tests demonstrate that f5C-finder achieves\u0000state-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively.\u0000The result highlights the effectiveness of biological language model in\u0000capturing both the order (sequential) and functional meaning (semantics) within\u0000genomes. Furthermore, the built-in interpretability allows us to understand\u0000what the model is learning, creating a bridge between identifying key\u0000sequential elements and a deeper exploration of their biological functions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers 瓦瑟斯坦虫洞利用变压器实现可扩展的最佳传输距离

arXiv - QuanBio - Genomics

Pub Date : 2024-04-15 DOI: arxiv-2404.09411

Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cassandra Burdziak, Tal Nawy, Anna Gilbert, Dana Pe'er

Optimal transport (OT) and the related Wasserstein metric (W) are powerfuland ubiquitous tools for comparing distributions. However, computing pairwiseWasserstein distances rapidly becomes intractable as cohort size grows. Anattractive alternative would be to find an embedding space in which pairwiseEuclidean distances map to OT distances, akin to standard multidimensionalscaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoderthat embeds empirical distributions into a latent space wherein Euclideandistances approximate OT distances. Extending MDS theory, we show that ourobjective function implies a bound on the error incurred when embeddingnon-Euclidean distances. Empirically, distances between Wormhole embeddingsclosely match Wasserstein distances, enabling linear time computation of OTdistances. Along with an encoder that maps distributions to embeddings,Wasserstein Wormhole includes a decoder that maps embeddings back todistributions, allowing for operations in the embedding space to generalize toOT spaces, such as Wasserstein barycenter estimation and OT interpolation. Bylending scalability and interpretability to OT approaches, Wasserstein Wormholeunlocks new avenues for data analysis in the fields of computational geometryand single-cell biology.

最优传输（OT）和相关的瓦瑟斯坦度量（W）是比较分布的强大而普遍的工具。然而，随着队列规模的扩大，计算成对的 Wasserstein 距离很快就变得难以处理。一个有吸引力的替代方法是找到一个嵌入空间，在这个空间中，成对欧氏距离映射到 OT 距离，类似于标准多维尺度（MDS）。我们提出的 Wasserstein Wormhole 是一种基于变换器的自动编码器，它能将经验分布嵌入到欧氏距离近似于加时赛距离的潜在空间中。通过扩展 MDS 理论，我们证明了目标函数意味着嵌入非欧几里得距离时产生的误差约束。从经验上看，虫洞嵌入之间的距离与瓦瑟斯坦距离非常接近，因此可以在线性时间内计算 OT 距离。除了将分布映射到嵌入的编码器之外，Wasserstein Wormhole 还包括一个将嵌入映射回分布的解码器，使得嵌入空间中的操作可以推广到 OT 空间，例如 Wasserstein barycenter 估计和 OT 插值。Wasserstein Wormhole 将可扩展性和可解释性赋予 OT 方法，为计算几何和单细胞生物学领域的数据分析开辟了新途径。

{"title":"Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers","authors":"Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cassandra Burdziak, Tal Nawy, Anna Gilbert, Dana Pe'er","doi":"arxiv-2404.09411","DOIUrl":"https://doi.org/arxiv-2404.09411","url":null,"abstract":"Optimal transport (OT) and the related Wasserstein metric (W) are powerful\u0000and ubiquitous tools for comparing distributions. However, computing pairwise\u0000Wasserstein distances rapidly becomes intractable as cohort size grows. An\u0000attractive alternative would be to find an embedding space in which pairwise\u0000Euclidean distances map to OT distances, akin to standard multidimensional\u0000scaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoder\u0000that embeds empirical distributions into a latent space wherein Euclidean\u0000distances approximate OT distances. Extending MDS theory, we show that our\u0000objective function implies a bound on the error incurred when embedding\u0000non-Euclidean distances. Empirically, distances between Wormhole embeddings\u0000closely match Wasserstein distances, enabling linear time computation of OT\u0000distances. Along with an encoder that maps distributions to embeddings,\u0000Wasserstein Wormhole includes a decoder that maps embeddings back to\u0000distributions, allowing for operations in the embedding space to generalize to\u0000OT spaces, such as Wasserstein barycenter estimation and OT interpolation. By\u0000lending scalability and interpretability to OT approaches, Wasserstein Wormhole\u0000unlocks new avenues for data analysis in the fields of computational geometry\u0000and single-cell biology.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Systematic Overview of Single-Cell Transcriptomics Databases, their Use cases, and Limitations 单细胞转录组学数据库、用例和局限性系统概述

arXiv - QuanBio - Genomics

Pub Date : 2024-04-15 DOI: arxiv-2404.10545

Mahnoor N. Gondal, Saad Ur Rehman Shah, Arul M. Chinnaiyan, Marcin Cieslik

Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq)technologies and experimental protocols have led to the generation of vastamounts of genomic data that populates several online databases andrepositories. Here, we systematically examined large-scale scRNA-seq databases,categorizing them based on their scope and purpose such as general,tissue-specific databases, disease-specific databases, cancer-focuseddatabases, and cell type-focused databases. Next, we discuss the technical andmethodological challenges associated with curating large-scale scRNA-seqdatabases, along with current computational solutions. We argue thatunderstanding scRNA-seq databases, including their limitations and assumptions,is crucial for effectively utilizing this data to make robust discoveries andidentify novel biological insights. Furthermore, we propose that bridging thegap between computational and wet lab scientists through user-friendlyweb-based platforms is needed for democratizing access to single-cell data.These platforms would facilitate interdisciplinary research, enablingresearchers from various disciplines to collaborate effectively. This reviewunderscores the importance of leveraging computational approaches to unravelthe complexities of single-cell data and offers a promising direction forfuture research in the field.

高通量单细胞RNA-seq（scRNA-seq）技术和实验方案的快速发展产生了大量的基因组数据，这些数据充斥着多个在线数据库和资料库。在这里，我们系统地研究了大型 scRNA-seq 数据库，并根据其范围和目的对它们进行了分类，如一般组织特异性数据库、疾病特异性数据库、癌症特异性数据库和细胞类型特异性数据库。接下来，我们讨论了与整理大规模 scRNA-seq 数据库相关的技术和方法论挑战，以及当前的计算解决方案。我们认为，了解 scRNA-seq 数据库，包括它们的局限性和假设，对于有效利用这些数据进行有力的发现和确定新的生物学见解至关重要。此外，我们还提出，要实现单细胞数据访问的民主化，就需要通过用户友好型网络平台来弥合计算科学家和湿实验室科学家之间的差距。这篇综述强调了利用计算方法揭示单细胞数据复杂性的重要性，并为该领域的未来研究指明了方向。

{"title":"A Systematic Overview of Single-Cell Transcriptomics Databases, their Use cases, and Limitations","authors":"Mahnoor N. Gondal, Saad Ur Rehman Shah, Arul M. Chinnaiyan, Marcin Cieslik","doi":"arxiv-2404.10545","DOIUrl":"https://doi.org/arxiv-2404.10545","url":null,"abstract":"Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq)\u0000technologies and experimental protocols have led to the generation of vast\u0000amounts of genomic data that populates several online databases and\u0000repositories. Here, we systematically examined large-scale scRNA-seq databases,\u0000categorizing them based on their scope and purpose such as general,\u0000tissue-specific databases, disease-specific databases, cancer-focused\u0000databases, and cell type-focused databases. Next, we discuss the technical and\u0000methodological challenges associated with curating large-scale scRNA-seq\u0000databases, along with current computational solutions. We argue that\u0000understanding scRNA-seq databases, including their limitations and assumptions,\u0000is crucial for effectively utilizing this data to make robust discoveries and\u0000identify novel biological insights. Furthermore, we propose that bridging the\u0000gap between computational and wet lab scientists through user-friendly\u0000web-based platforms is needed for democratizing access to single-cell data.\u0000These platforms would facilitate interdisciplinary research, enabling\u0000researchers from various disciplines to collaborate effectively. This review\u0000underscores the importance of leveraging computational approaches to unravel\u0000the complexities of single-cell data and offers a promising direction for\u0000future research in the field.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"230 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140615422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling scRDiT：通过扩散变换器和加速采样生成单细胞 RNA-seq 数据

arXiv - QuanBio - Genomics

Pub Date : 2024-04-09 DOI: arxiv-2404.06153

Shengze Dong, Zhuorui Cui, Ding Liu, Jinzhi Lei

Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreakingtechnology extensively utilized in biological research, facilitating theexamination of gene expression at the individual cell level within a giventissue sample. While numerous tools have been developed for scRNA-seq dataanalysis, the challenge persists in capturing the distinct features of suchdata and replicating virtual datasets that share analogous statisticalproperties. Results: Our study introduces a generative approach termedscRNA-seq Diffusion Transformer (scRDiT). This method generates virtualscRNA-seq data by leveraging a real dataset. The method is a neural networkconstructed based on Denoising Diffusion Probabilistic Models (DDPMs) andDiffusion Transformers (DiTs). This involves subjecting Gaussian noises to thereal dataset through iterative noise-adding steps and ultimately restoring thenoises to form scRNA-seq samples. This scheme allows us to learn data featuresfrom actual scRNA-seq samples during model training. Our experiments, conductedon two distinct scRNA-seq datasets, demonstrate superior performance.Additionally, the model sampling process is expedited by incorporatingDenoising Diffusion Implicit Models (DDIM). scRDiT presents a unifiedmethodology empowering users to train neural network models with their uniquescRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seqsamples. Availability and implementation: https://github.com/DongShengze/scRDiT

动机单细胞 RNA 测序（scRNA-seq）是生物研究中广泛应用的一项突破性技术，有助于研究给定组织样本中单个细胞水平的基因表达。虽然已经开发出许多用于 scRNA-seq 数据分析的工具，但在捕捉此类数据的独特特征和复制具有类似统计属性的虚拟数据集方面仍存在挑战。结果：我们的研究引入了一种称为 scRNA-seq Diffusion Transformer（scRDiT）的生成方法。该方法利用真实数据集生成虚拟 scRNA-seq 数据。该方法是基于去噪扩散概率模型（DDPM）和扩散变换器（DiT）构建的神经网络。这包括通过迭代噪声添加步骤对原始数据集进行高斯噪声处理，并最终恢复噪声以形成 scRNA-seq 样本。这种方案使我们能够在模型训练期间从实际的 scRNA-seq 样本中学习数据特征。我们在两个不同的 scRNA-seq 数据集上进行的实验证明了其卓越的性能。此外，通过结合噪声扩散隐含模型（DDIM），我们加快了模型采样过程。scRDiT 提出了一种统一的方法论，使用户能够利用其独特的 scRNA-seq 数据集训练神经网络模型，从而生成大量高质量的 scRNA-seq 样本。可用性和实施：https://github.com/DongShengze/scRDiT

{"title":"scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling","authors":"Shengze Dong, Zhuorui Cui, Ding Liu, Jinzhi Lei","doi":"arxiv-2404.06153","DOIUrl":"https://doi.org/arxiv-2404.06153","url":null,"abstract":"Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking\u0000technology extensively utilized in biological research, facilitating the\u0000examination of gene expression at the individual cell level within a given\u0000tissue sample. While numerous tools have been developed for scRNA-seq data\u0000analysis, the challenge persists in capturing the distinct features of such\u0000data and replicating virtual datasets that share analogous statistical\u0000properties. Results: Our study introduces a generative approach termed\u0000scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual\u0000scRNA-seq data by leveraging a real dataset. The method is a neural network\u0000constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and\u0000Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the\u0000real dataset through iterative noise-adding steps and ultimately restoring the\u0000noises to form scRNA-seq samples. This scheme allows us to learn data features\u0000from actual scRNA-seq samples during model training. Our experiments, conducted\u0000on two distinct scRNA-seq datasets, demonstrate superior performance.\u0000Additionally, the model sampling process is expedited by incorporating\u0000Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified\u0000methodology empowering users to train neural network models with their unique\u0000scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq\u0000samples. Availability and implementation: https://github.com/DongShengze/scRDiT","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding scCDCG：通过深度切分信息图嵌入为单细胞 RNA-seq 进行高效深度结构聚类

arXiv - QuanBio - Genomics

Pub Date : 2024-04-09 DOI: arxiv-2404.06167

Ping Xu, Zhiyuan Ning, Meng Xiao, Guihai Feng, Xin Li, Yuanchun Zhou, Pengfei Wang

Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellularheterogeneity and diversity, offering invaluable insights for bioinformaticsadvancements. Despite its potential, traditional clustering methods inscRNA-seq data analysis often neglect the structural information embedded ingene expression profiles, crucial for understanding cellular correlations anddependencies. Existing strategies, including graph neural networks, facechallenges in handling the inefficiency due to scRNA-seq data's intrinsichigh-dimension and high-sparsity. Addressing these limitations, we introducescCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novelframework designed for efficient and accurate clustering of scRNA-seq data thatsimultaneously utilizes intercellular high-order structural information. scCDCGcomprises three main components: (i) A graph embedding module utilizing deepcut-informed techniques, which effectively captures intercellular high-orderstructural information, overcoming the over-smoothing and inefficiency issuesprevalent in prior graph neural network methods. (ii) A self-supervisedlearning module guided by optimal transport, tailored to accommodate the uniquecomplexities of scRNA-seq data, specifically its high-dimension andhigh-sparsity. (iii) An autoencoder-based feature learning module thatsimplifies model complexity through effective dimension reduction and featureextraction. Our extensive experiments on 6 datasets demonstrate scCDCG'ssuperior performance and efficiency compared to 7 established models,underscoring scCDCG's potential as a transformative tool in scRNA-seq dataanalysis. Our code is available at: https://github.com/XPgogogo/scCDCG.

单细胞 RNA 测序（scRNA-seq）对于揭示细胞的异质性和多样性至关重要，为生物信息学的发展提供了宝贵的见解。尽管潜力巨大，但传统的 scRNA-seq 数据分析聚类方法往往忽视了基因表达谱中蕴含的结构信息，而这些信息对于理解细胞相关性和依赖性至关重要。包括图神经网络在内的现有策略在处理 scRNA-seq 数据的内在高维度和高稀疏性所导致的低效率方面面临挑战。为了解决这些局限性，我们引入了单细胞 RNA-seq 聚类（single-cell RNA-seq Clustering via Deep Cut-informed Graph），这是一个新颖的框架，旨在对 scRNA-seq 数据进行高效准确的聚类，同时利用细胞间的高阶结构信息：(i) 利用深度切分技术的图嵌入模块，它能有效捕捉细胞间高阶结构信息，克服了之前图神经网络方法中普遍存在的过度平滑和低效问题。(ii) 以最优传输为指导的自监督学习模块，专为适应 scRNA-seq 数据的独特复杂性而定制，特别是其高维度和高稀疏性。(iii) 基于自动编码器的特征学习模块，通过有效的降维和特征提取简化了模型的复杂性。我们在 6 个数据集上进行了大量实验，证明 scCDCG 的性能和效率优于 7 个已建立的模型，突出了 scCDCG 作为 scRNA-seq 数据分析变革性工具的潜力。我们的代码可在以下网址获取：https://github.com/XPgogogo/scCDCG。

{"title":"scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding","authors":"Ping Xu, Zhiyuan Ning, Meng Xiao, Guihai Feng, Xin Li, Yuanchun Zhou, Pengfei Wang","doi":"arxiv-2404.06167","DOIUrl":"https://doi.org/arxiv-2404.06167","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular\u0000heterogeneity and diversity, offering invaluable insights for bioinformatics\u0000advancements. Despite its potential, traditional clustering methods in\u0000scRNA-seq data analysis often neglect the structural information embedded in\u0000gene expression profiles, crucial for understanding cellular correlations and\u0000dependencies. Existing strategies, including graph neural networks, face\u0000challenges in handling the inefficiency due to scRNA-seq data's intrinsic\u0000high-dimension and high-sparsity. Addressing these limitations, we introduce\u0000scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel\u0000framework designed for efficient and accurate clustering of scRNA-seq data that\u0000simultaneously utilizes intercellular high-order structural information. scCDCG\u0000comprises three main components: (i) A graph embedding module utilizing deep\u0000cut-informed techniques, which effectively captures intercellular high-order\u0000structural information, overcoming the over-smoothing and inefficiency issues\u0000prevalent in prior graph neural network methods. (ii) A self-supervised\u0000learning module guided by optimal transport, tailored to accommodate the unique\u0000complexities of scRNA-seq data, specifically its high-dimension and\u0000high-sparsity. (iii) An autoencoder-based feature learning module that\u0000simplifies model complexity through effective dimension reduction and feature\u0000extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's\u0000superior performance and efficiency compared to 7 established models,\u0000underscoring scCDCG's potential as a transformative tool in scRNA-seq data\u0000analysis. Our code is available at: https://github.com/XPgogogo/scCDCG.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guide to k-mer approaches for genomics across the tree of life 跨生命树基因组学 k-mer 方法指南

arXiv - QuanBio - Genomics

Pub Date : 2024-04-01 DOI: arxiv-2404.01519

Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron

The wide array of currently available genomes display a wonderful diversityin size, composition and structure with many more to come thanks to severalglobal biodiversity genomics initiatives starting in recent years. However,sequencing of genomes, even with all the recent advances, can still bechallenging for both technical (e.g. small physical size, contaminated samples,or access to appropriate sequencing platforms) and biological reasons (e.g.germline restricted DNA, variable ploidy levels, sex chromosomes, or very largegenomes). In recent years, k-mer-based techniques have become popular toovercome some of these challenges. They are based on the simple process ofdividing the analysed sequences (e.g. raw reads or genomes) into a set ofsub-sequences of length k, called k-mers. Despite this apparent simplicity,k-mer-based analysis allows for a rapid and intuitive assessment of complexsequencing datasets. Here, we provide the first comprehensive review to thetheoretical properties and practical applications of k-mers in biodiversitygenomics, serving as a reference manual for this powerful approach.

由于近年来开始的一些全球生物多样性基因组学计划，目前可用的大量基因组在大小、组成和结构上都呈现出了奇妙的多样性，而且还有更多的基因组即将问世。然而，即使基因组测序取得了最新进展，但由于技术（如物理尺寸小、样本受污染或难以获得合适的测序平台）和生物学（如种系受限 DNA、倍性水平不一、性染色体或超大基因组）等原因，测序工作仍然充满挑战。近年来，基于 k-mer的技术开始流行，以克服其中的一些挑战。它们的基础是将分析序列（如原始读数或基因组）划分为一组长度为 k 的子序列（称为 k-mers）的简单过程。尽管表面上看似简单，但基于 k 分子的分析可以快速、直观地评估复杂的测序数据集。在这里，我们首次全面评述了生物多样性基因组学中 k 分子的理论特性和实际应用，为这种强大的方法提供了参考手册。

{"title":"Guide to k-mer approaches for genomics across the tree of life","authors":"Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron","doi":"arxiv-2404.01519","DOIUrl":"https://doi.org/arxiv-2404.01519","url":null,"abstract":"The wide array of currently available genomes display a wonderful diversity\u0000in size, composition and structure with many more to come thanks to several\u0000global biodiversity genomics initiatives starting in recent years. However,\u0000sequencing of genomes, even with all the recent advances, can still be\u0000challenging for both technical (e.g. small physical size, contaminated samples,\u0000or access to appropriate sequencing platforms) and biological reasons (e.g.\u0000germline restricted DNA, variable ploidy levels, sex chromosomes, or very large\u0000genomes). In recent years, k-mer-based techniques have become popular to\u0000overcome some of these challenges. They are based on the simple process of\u0000dividing the analysed sequences (e.g. raw reads or genomes) into a set of\u0000sub-sequences of length k, called k-mers. Despite this apparent simplicity,\u0000k-mer-based analysis allows for a rapid and intuitive assessment of complex\u0000sequencing datasets. Here, we provide the first comprehensive review to the\u0000theoretical properties and practical applications of k-mers in biodiversity\u0000genomics, serving as a reference manual for this powerful approach.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Just-DNA-Seq, open-source personal genomics platform: longevity science for everyone 开源个人基因组学平台 Just-DNA-Seq：人人享有的长寿科学

arXiv - QuanBio - Genomics

Pub Date : 2024-03-28 DOI: arxiv-2403.19087

Kulaga AntonInstitute for Biostatistics and Informatics in Medicine and Ageing ResearchInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Borysova OlgaInternational Longevity AllianceCellFabrik SRL, Karmazin AlexeyInternational Longevity AllianceMitoSpace, Koval MariaInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Usanov NikolayInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Fedorova AlinaInstitute of Biochemistry of the Romanian Academy, Evfratov SergeyInstitute of Biochemistry of the Romanian Academy, Pushkareva MalvinaInstitute of Biochemistry of the Romanian Academy, Ryangguk KimOak Bioinformatics LLC, Tacutu RobiSecvADN SRL

Genomic data has become increasingly accessible to the general public withthe advent of companies offering whole genome sequencing at a relatively lowcost. However, their reports are not verifiable due to a lack of crucialdetails and transparency: polygenic risk scores do not always mention all thepolymorphisms involved. Simultaneously, tackling the manual investigation andinterpretation of data proves challenging for individuals lacking a backgroundin genetics. Currently, there is no open-source or commercial solution thatprovides comprehensive longevity reports surpassing a limited number ofpolymorphisms. Additionally, there are no ready-made, out-of-the-box solutionsavailable that require minimal expertise to generate reports independently. Toaddress these issues, we have developed the Just-DNA-Seq open-source genomicplatform. Just-DNA-Seq aims to provide a user-friendly solution to genomeannotation by allowing users to upload their own VCF files and receiveannotations of their genetic variants and polygenic risk scores related tolongevity. We also created GeneticsGenie custom GPT that can answer geneticsquestions based on our modules. With the Just-DNA-Seq platform, we want toprovide full information regarding the genetics of long life:disease-predisposing variants, that can reduce lifespan and manifest atdifferent age (cardiovascular, oncological, neurodegenerative diseases, etc.),pro-longevity variants and longevity drug pharmacokinetics. In this researcharticle, we will discuss the features and capabilities of Just-DNA-Seq, and howit can benefit individuals looking to understand and improve their health. It'scrucial to note that the Just-DNA-Seq platform is exclusively intended forscientific and informational purposes and is not suitable for medicalapplications.

随着以相对低廉的价格提供全基因组测序服务的公司的出现，基因组数据越来越容易被公众获取。然而，由于缺乏关键细节和透明度，这些公司的报告无法验证：多基因风险评分并不总是提及所涉及的所有多态性。同时，对于缺乏遗传学背景的人来说，手工调查和解释数据也具有挑战性。目前，还没有开放源码或商业解决方案能提供超过有限数量多态性的综合长寿报告。此外，也没有现成的、开箱即用的解决方案，只需极少的专业知识就能独立生成报告。为了解决这些问题，我们开发了 Just-DNA-Seq 开源基因组平台。Just-DNA-Seq 旨在为基因组注释提供用户友好型解决方案，允许用户上传自己的 VCF 文件，并接收其基因变异和与基因相关的多基因风险评分的注释。我们还创建了 GeneticsGenie 定制 GPT，可根据我们的模块回答遗传学问题。通过 Just-DNA-Seq 平台，我们希望提供有关长寿遗传学的全部信息：可缩短寿命并在不同年龄段表现出来的疾病易感变体（心血管、肿瘤、神经退行性疾病等）、促长寿变体和长寿药物药代动力学。在这篇研究文章中，我们将讨论 Just-DNA-Seq 的特点和功能，以及它如何为希望了解和改善自身健康状况的人带来益处。值得注意的是，Just-DNA-Seq 平台仅用于科学和信息目的，并不适用于医疗应用。

{"title":"Just-DNA-Seq, open-source personal genomics platform: longevity science for everyone","authors":"Kulaga AntonInstitute for Biostatistics and Informatics in Medicine and Ageing ResearchInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Borysova OlgaInternational Longevity AllianceCellFabrik SRL, Karmazin AlexeyInternational Longevity AllianceMitoSpace, Koval MariaInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Usanov NikolayInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Fedorova AlinaInstitute of Biochemistry of the Romanian Academy, Evfratov SergeyInstitute of Biochemistry of the Romanian Academy, Pushkareva MalvinaInstitute of Biochemistry of the Romanian Academy, Ryangguk KimOak Bioinformatics LLC, Tacutu RobiSecvADN SRL","doi":"arxiv-2403.19087","DOIUrl":"https://doi.org/arxiv-2403.19087","url":null,"abstract":"Genomic data has become increasingly accessible to the general public with\u0000the advent of companies offering whole genome sequencing at a relatively low\u0000cost. However, their reports are not verifiable due to a lack of crucial\u0000details and transparency: polygenic risk scores do not always mention all the\u0000polymorphisms involved. Simultaneously, tackling the manual investigation and\u0000interpretation of data proves challenging for individuals lacking a background\u0000in genetics. Currently, there is no open-source or commercial solution that\u0000provides comprehensive longevity reports surpassing a limited number of\u0000polymorphisms. Additionally, there are no ready-made, out-of-the-box solutions\u0000available that require minimal expertise to generate reports independently. To\u0000address these issues, we have developed the Just-DNA-Seq open-source genomic\u0000platform. Just-DNA-Seq aims to provide a user-friendly solution to genome\u0000annotation by allowing users to upload their own VCF files and receive\u0000annotations of their genetic variants and polygenic risk scores related to\u0000longevity. We also created GeneticsGenie custom GPT that can answer genetics\u0000questions based on our modules. With the Just-DNA-Seq platform, we want to\u0000provide full information regarding the genetics of long life:\u0000disease-predisposing variants, that can reduce lifespan and manifest at\u0000different age (cardiovascular, oncological, neurodegenerative diseases, etc.),\u0000pro-longevity variants and longevity drug pharmacokinetics. In this research\u0000article, we will discuss the features and capabilities of Just-DNA-Seq, and how\u0000it can benefit individuals looking to understand and improve their health. It's\u0000crucial to note that the Just-DNA-Seq platform is exclusively intended for\u0000scientific and informational purposes and is not suitable for medical\u0000applications.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA 导航真核生物基因组注释管道：通往 BRAKER、Galba 和 TSEBRA 的路线图

arXiv - QuanBio - Genomics

Pub Date : 2024-03-28 DOI: arxiv-2403.19416

Tomáš Brůna, Lars Gabriel, Katharina J. Hoff

Annotating the structure of protein-coding genes represents a major challengein the analysis of eukaryotic genomes. This task sets the groundwork forsubsequent genomic studies aimed at understanding the functions of individualgenes. BRAKER and Galba are two fully automated and containerized pipelinesdesigned to perform accurate genome annotation. BRAKER integrates theGeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attainhigh sensitivity and precision. BRAKER is adept at handling genomes of anysize, provided that it has access to both transcript expression sequencing dataand an extensive protein database from the target clade. In particular, BRAKERdemonstrates high accuracy even with only one type of these extrinsic evidencesources, although it should be noted that accuracy diminishes for largergenomes under such conditions. In contrast, Galba adopts a distinct methodologyutilizing the outcomes of direct protein-to-genome spliced alignments usingminiprot to generate training genes and evidence for gene prediction inAUGUSTUS. Galba has superior accuracy in large genomes if protein sequences arethe only source of evidence. This chapter provides practical guidelines foremploying both pipelines in the annotation of eukaryotic genomes, with a focuson insect genomes.

注释蛋白质编码基因的结构是真核生物基因组分析中的一项重大挑战。这项任务为后续旨在了解单个基因功能的基因组研究奠定了基础。BRAKER 和 Galba 是两个全自动的容器化管道，旨在进行精确的基因组注释。BRAKER 集成了 GeneMark-ETP 和 AUGUSTUS 基因查找器，并采用 TSEBRA 组合器来实现高灵敏度和高精确度。BRAKER 擅长处理任何规模的基因组，前提是它能获得目标支系的转录本表达测序数据和大量蛋白质数据库。特别是，即使只有一种外在证据资源，BRAKER 也能表现出很高的准确性，不过需要注意的是，在这种条件下，较大基因组的准确性会降低。相比之下，Galba 采用了一种独特的方法，即利用 Miniprot 直接进行蛋白质与基因组剪接比对的结果来生成训练基因和证据，以便在 AUGUSTUS 中进行基因预测。如果蛋白质序列是唯一的证据来源，Galba 在大型基因组中具有更高的准确性。本章提供了在真核生物基因组注释中使用这两种管道的实用指南，重点是昆虫基因组。

{"title":"Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA","authors":"Tomáš Brůna, Lars Gabriel, Katharina J. Hoff","doi":"arxiv-2403.19416","DOIUrl":"https://doi.org/arxiv-2403.19416","url":null,"abstract":"Annotating the structure of protein-coding genes represents a major challenge\u0000in the analysis of eukaryotic genomes. This task sets the groundwork for\u0000subsequent genomic studies aimed at understanding the functions of individual\u0000genes. BRAKER and Galba are two fully automated and containerized pipelines\u0000designed to perform accurate genome annotation. BRAKER integrates the\u0000GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain\u0000high sensitivity and precision. BRAKER is adept at handling genomes of any\u0000size, provided that it has access to both transcript expression sequencing data\u0000and an extensive protein database from the target clade. In particular, BRAKER\u0000demonstrates high accuracy even with only one type of these extrinsic evidence\u0000sources, although it should be noted that accuracy diminishes for larger\u0000genomes under such conditions. In contrast, Galba adopts a distinct methodology\u0000utilizing the outcomes of direct protein-to-genome spliced alignments using\u0000miniprot to generate training genes and evidence for gene prediction in\u0000AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are\u0000the only source of evidence. This chapter provides practical guidelines for\u0000employing both pipelines in the annotation of eukaryotic genomes, with a focus\u0000on insect genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Genetic diversity of barley accessions and their response under abiotic stresses using different approaches 采用不同方法研究大麦品种的遗传多样性及其在非生物胁迫下的反应

arXiv - QuanBio - Genomics

Pub Date : 2024-03-21 DOI: arxiv-2403.14181

Djshwar Dhahir Lateef, Nawroz Abdul-razzak Tahir

In this investigation, five separate experiments were carried out. The firstexperiments were examined the molecular characteristics of 59 barley accessionscollected from different regions in Iraq using three different molecularmarkers (ISSR, CDDP, and Scot). A total of 391 amplified polymorphic bands weregenerated using forty-four ISSR, nine CDDP, and twelve Scot primers, which theytotally observed 255, 35, and 101 polymorphic bands respectively. The meanvalues of PIC for ISSR, CDDP, and Scot markers were 0.74, 0.63, and 0.80,respectively, indicating the efficiency of the underlying markers in detectingpolymorphic status among the studied barley accessions. Based on the respectivemarkers, the barley accessions were classified and clustered into two maingroups using the UPGMA and population structure analysis. Results of claustralanalyses showed that the variation patterns corresponded with the geographicaldistribution of barley accessions.

在这项调查中，我们分别进行了五项实验。第一项实验使用三种不同的分子标记（ISSR、CDDP 和 Scot）检测了从伊拉克不同地区采集的 59 个大麦品种的分子特征。使用 44 个 ISSR 引物、9 个 CDDP 引物和 12 个 Scot 引物共扩增出 391 条多态性条带，分别观察到 255 条、35 条和 101 条多态性条带。ISSR、CDDP 和 Scot 标记的 PIC 平均值分别为 0.74、0.63 和 0.80，表明相关标记在检测所研究的大麦品种的多态性状况方面具有很高的效率。根据相应的标记，采用 UPGMA 和种群结构分析对大麦品种进行了分类和聚类。聚类分析结果表明，变异模式与大麦品种的地理分布相符。

引用次数: 0

Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction Path-GPTOmic：用于生存结果预测的平衡多模态学习框架

arXiv - QuanBio - Genomics

Pub Date : 2024-03-18 DOI: arxiv-2403.11375

Hongxiao Wang, Yang Yang, Zhuo Zhao, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen

For predicting cancer survival outcomes, standard approaches in clinicalresearch are often based on two main modalities: pathology images for observingcell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying geneexpressions. However, existing pathology-genomic multi-modal algorithms facesignificant challenges: (1) Valuable biological insights regarding genes andgene-gene interactions are frequently overlooked; (2) one modality oftendominates the optimization process, causing inadequate training for the othermodality. In this paper, we introduce a new multi-modal ``Path-GPTOmic"framework for cancer survival outcome prediction. First, to extract valuablebiological insights, we regulate the embedding space of a foundation model,scGPT, initially trained on single-cell RNA-seq data, making it adaptable forbulk RNA-seq data. Second, to address the imbalance-between-modalities problem,we propose a gradient modulation mechanism tailored to the Cox partiallikelihood loss for survival prediction. The contributions of the modalitiesare dynamically monitored and adjusted during the training process, encouragingthat both modalities are sufficiently trained. Evaluated on two TCGA(The CancerGenome Atlas) datasets, our model achieves substantially improved survivalprediction accuracy.

为预测癌症生存结果，临床研究中的标准方法通常基于两种主要模式：用于观察细胞形态特征的病理图像和用于量化基因表达的基因组学（如批量 RNA-seq）。然而，现有的病理-基因组多模态算法面临着重大挑战：（1）关于基因和基因-基因相互作用的宝贵生物学见解经常被忽视；（2）一种模态经常主导优化过程，导致另一种模态的训练不足。在本文中，我们为癌症生存结果预测引入了一种新的多模态 "Path-GPTOmic "框架。首先，为了提取有价值的生物学见解，我们调节了基础模型 scGPT 的嵌入空间，该模型最初是在单细胞 RNA-seq 数据上训练的，使其能够适应大量 RNA-seq 数据。其次，为了解决模态间的不平衡问题，我们提出了一种梯度调节机制，该机制是为生存预测的 Cox 部分似然损失量身定制的。在训练过程中，我们会动态监测和调整两种模态的贡献，以确保两种模态都得到充分训练。在两个TCGA（The CancerGenome Atlas）数据集上进行评估后，我们的模型大大提高了生存预测的准确性。

{"title":"Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction","authors":"Hongxiao Wang, Yang Yang, Zhuo Zhao, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen","doi":"arxiv-2403.11375","DOIUrl":"https://doi.org/arxiv-2403.11375","url":null,"abstract":"For predicting cancer survival outcomes, standard approaches in clinical\u0000research are often based on two main modalities: pathology images for observing\u0000cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene\u0000expressions. However, existing pathology-genomic multi-modal algorithms face\u0000significant challenges: (1) Valuable biological insights regarding genes and\u0000gene-gene interactions are frequently overlooked; (2) one modality often\u0000dominates the optimization process, causing inadequate training for the other\u0000modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic\"\u0000framework for cancer survival outcome prediction. First, to extract valuable\u0000biological insights, we regulate the embedding space of a foundation model,\u0000scGPT, initially trained on single-cell RNA-seq data, making it adaptable for\u0000bulk RNA-seq data. Second, to address the imbalance-between-modalities problem,\u0000we propose a gradient modulation mechanism tailored to the Cox partial\u0000likelihood loss for survival prediction. The contributions of the modalities\u0000are dynamically monitored and adjusted during the training process, encouraging\u0000that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer\u0000Genome Atlas) datasets, our model achieves substantially improved survival\u0000prediction accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0