As a prevalent and dynamically regulated epigenetic modification, 5-formylcytidine (f5C) is crucial in various biological processes. However, traditional experimental methods for f5C detection are often laborious and time-consuming, limiting their ability to map f5C sites across the transcriptome comprehensively. While computational approaches offer a cost-effective and high-throughput alternative, no recognition model for f5C has been developed to date. Drawing inspiration from language models in natural language processing, this study presents f5C-finder, an ensemble neural network-based model utilizing multi-head attention for the identification of f5C. Five distinct feature extraction methods were employed to construct five individual artificial neural networks, and these networks were subsequently integrated through ensemble learning to create f5C-finder. 10-fold cross-validation and independent tests demonstrate that f5C-finder achieves state-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively. The result highlights the effectiveness of biological language model in capturing both the order (sequential) and functional meaning (semantics) within genomes. Furthermore, the built-in interpretability allows us to understand what the model is learning, creating a bridge between identifying key sequential elements and a deeper exploration of their biological functions.
{"title":"F5C-finder: An Explainable and Ensemble Biological Language Model for Predicting 5-Formylcytidine Modifications on mRNA","authors":"Guohao Wang, Ting Liu, Hongqiang Lyu, Ze Liu","doi":"arxiv-2404.13265","DOIUrl":"https://doi.org/arxiv-2404.13265","url":null,"abstract":"As a prevalent and dynamically regulated epigenetic modification,\u00005-formylcytidine (f5C) is crucial in various biological processes. However,\u0000traditional experimental methods for f5C detection are often laborious and\u0000time-consuming, limiting their ability to map f5C sites across the\u0000transcriptome comprehensively. While computational approaches offer a\u0000cost-effective and high-throughput alternative, no recognition model for f5C\u0000has been developed to date. Drawing inspiration from language models in natural\u0000language processing, this study presents f5C-finder, an ensemble neural\u0000network-based model utilizing multi-head attention for the identification of\u0000f5C. Five distinct feature extraction methods were employed to construct five\u0000individual artificial neural networks, and these networks were subsequently\u0000integrated through ensemble learning to create f5C-finder. 10-fold\u0000cross-validation and independent tests demonstrate that f5C-finder achieves\u0000state-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively.\u0000The result highlights the effectiveness of biological language model in\u0000capturing both the order (sequential) and functional meaning (semantics) within\u0000genomes. Furthermore, the built-in interpretability allows us to understand\u0000what the model is learning, creating a bridge between identifying key\u0000sequential elements and a deeper exploration of their biological functions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cassandra Burdziak, Tal Nawy, Anna Gilbert, Dana Pe'er
Optimal transport (OT) and the related Wasserstein metric (W) are powerful and ubiquitous tools for comparing distributions. However, computing pairwise Wasserstein distances rapidly becomes intractable as cohort size grows. An attractive alternative would be to find an embedding space in which pairwise Euclidean distances map to OT distances, akin to standard multidimensional scaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoder that embeds empirical distributions into a latent space wherein Euclidean distances approximate OT distances. Extending MDS theory, we show that our objective function implies a bound on the error incurred when embedding non-Euclidean distances. Empirically, distances between Wormhole embeddings closely match Wasserstein distances, enabling linear time computation of OT distances. Along with an encoder that maps distributions to embeddings, Wasserstein Wormhole includes a decoder that maps embeddings back to distributions, allowing for operations in the embedding space to generalize to OT spaces, such as Wasserstein barycenter estimation and OT interpolation. By lending scalability and interpretability to OT approaches, Wasserstein Wormhole unlocks new avenues for data analysis in the fields of computational geometry and single-cell biology.
最优传输(OT)和相关的瓦瑟斯坦度量(W)是比较分布的强大而普遍的工具。然而,随着队列规模的扩大,计算成对的 Wasserstein 距离很快就变得难以处理。一个有吸引力的替代方法是找到一个嵌入空间,在这个空间中,成对欧氏距离映射到 OT 距离,类似于标准多维尺度(MDS)。我们提出的 Wasserstein Wormhole 是一种基于变换器的自动编码器,它能将经验分布嵌入到欧氏距离近似于加时赛距离的潜在空间中。通过扩展 MDS 理论,我们证明了目标函数意味着嵌入非欧几里得距离时产生的误差约束。从经验上看,虫洞嵌入之间的距离与瓦瑟斯坦距离非常接近,因此可以在线性时间内计算 OT 距离。除了将分布映射到嵌入的编码器之外,Wasserstein Wormhole 还包括一个将嵌入映射回分布的解码器,使得嵌入空间中的操作可以推广到 OT 空间,例如 Wasserstein barycenter 估计和 OT 插值。Wasserstein Wormhole 将可扩展性和可解释性赋予 OT 方法,为计算几何和单细胞生物学领域的数据分析开辟了新途径。
{"title":"Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers","authors":"Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cassandra Burdziak, Tal Nawy, Anna Gilbert, Dana Pe'er","doi":"arxiv-2404.09411","DOIUrl":"https://doi.org/arxiv-2404.09411","url":null,"abstract":"Optimal transport (OT) and the related Wasserstein metric (W) are powerful\u0000and ubiquitous tools for comparing distributions. However, computing pairwise\u0000Wasserstein distances rapidly becomes intractable as cohort size grows. An\u0000attractive alternative would be to find an embedding space in which pairwise\u0000Euclidean distances map to OT distances, akin to standard multidimensional\u0000scaling (MDS). We present Wasserstein Wormhole, a transformer-based autoencoder\u0000that embeds empirical distributions into a latent space wherein Euclidean\u0000distances approximate OT distances. Extending MDS theory, we show that our\u0000objective function implies a bound on the error incurred when embedding\u0000non-Euclidean distances. Empirically, distances between Wormhole embeddings\u0000closely match Wasserstein distances, enabling linear time computation of OT\u0000distances. Along with an encoder that maps distributions to embeddings,\u0000Wasserstein Wormhole includes a decoder that maps embeddings back to\u0000distributions, allowing for operations in the embedding space to generalize to\u0000OT spaces, such as Wasserstein barycenter estimation and OT interpolation. By\u0000lending scalability and interpretability to OT approaches, Wasserstein Wormhole\u0000unlocks new avenues for data analysis in the fields of computational geometry\u0000and single-cell biology.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahnoor N. Gondal, Saad Ur Rehman Shah, Arul M. Chinnaiyan, Marcin Cieslik
Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq) technologies and experimental protocols have led to the generation of vast amounts of genomic data that populates several online databases and repositories. Here, we systematically examined large-scale scRNA-seq databases, categorizing them based on their scope and purpose such as general, tissue-specific databases, disease-specific databases, cancer-focused databases, and cell type-focused databases. Next, we discuss the technical and methodological challenges associated with curating large-scale scRNA-seq databases, along with current computational solutions. We argue that understanding scRNA-seq databases, including their limitations and assumptions, is crucial for effectively utilizing this data to make robust discoveries and identify novel biological insights. Furthermore, we propose that bridging the gap between computational and wet lab scientists through user-friendly web-based platforms is needed for democratizing access to single-cell data. These platforms would facilitate interdisciplinary research, enabling researchers from various disciplines to collaborate effectively. This review underscores the importance of leveraging computational approaches to unravel the complexities of single-cell data and offers a promising direction for future research in the field.
{"title":"A Systematic Overview of Single-Cell Transcriptomics Databases, their Use cases, and Limitations","authors":"Mahnoor N. Gondal, Saad Ur Rehman Shah, Arul M. Chinnaiyan, Marcin Cieslik","doi":"arxiv-2404.10545","DOIUrl":"https://doi.org/arxiv-2404.10545","url":null,"abstract":"Rapid advancements in high-throughput single-cell RNA-seq (scRNA-seq)\u0000technologies and experimental protocols have led to the generation of vast\u0000amounts of genomic data that populates several online databases and\u0000repositories. Here, we systematically examined large-scale scRNA-seq databases,\u0000categorizing them based on their scope and purpose such as general,\u0000tissue-specific databases, disease-specific databases, cancer-focused\u0000databases, and cell type-focused databases. Next, we discuss the technical and\u0000methodological challenges associated with curating large-scale scRNA-seq\u0000databases, along with current computational solutions. We argue that\u0000understanding scRNA-seq databases, including their limitations and assumptions,\u0000is crucial for effectively utilizing this data to make robust discoveries and\u0000identify novel biological insights. Furthermore, we propose that bridging the\u0000gap between computational and wet lab scientists through user-friendly\u0000web-based platforms is needed for democratizing access to single-cell data.\u0000These platforms would facilitate interdisciplinary research, enabling\u0000researchers from various disciplines to collaborate effectively. This review\u0000underscores the importance of leveraging computational approaches to unravel\u0000the complexities of single-cell data and offers a promising direction for\u0000future research in the field.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"230 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140615422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Results: Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples. Availability and implementation: https://github.com/DongShengze/scRDiT
{"title":"scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling","authors":"Shengze Dong, Zhuorui Cui, Ding Liu, Jinzhi Lei","doi":"arxiv-2404.06153","DOIUrl":"https://doi.org/arxiv-2404.06153","url":null,"abstract":"Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking\u0000technology extensively utilized in biological research, facilitating the\u0000examination of gene expression at the individual cell level within a given\u0000tissue sample. While numerous tools have been developed for scRNA-seq data\u0000analysis, the challenge persists in capturing the distinct features of such\u0000data and replicating virtual datasets that share analogous statistical\u0000properties. Results: Our study introduces a generative approach termed\u0000scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual\u0000scRNA-seq data by leveraging a real dataset. The method is a neural network\u0000constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and\u0000Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the\u0000real dataset through iterative noise-adding steps and ultimately restoring the\u0000noises to form scRNA-seq samples. This scheme allows us to learn data features\u0000from actual scRNA-seq samples during model training. Our experiments, conducted\u0000on two distinct scRNA-seq datasets, demonstrate superior performance.\u0000Additionally, the model sampling process is expedited by incorporating\u0000Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified\u0000methodology empowering users to train neural network models with their unique\u0000scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq\u0000samples. Availability and implementation: https://github.com/DongShengze/scRDiT","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular heterogeneity and diversity, offering invaluable insights for bioinformatics advancements. Despite its potential, traditional clustering methods in scRNA-seq data analysis often neglect the structural information embedded in gene expression profiles, crucial for understanding cellular correlations and dependencies. Existing strategies, including graph neural networks, face challenges in handling the inefficiency due to scRNA-seq data's intrinsic high-dimension and high-sparsity. Addressing these limitations, we introduce scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel framework designed for efficient and accurate clustering of scRNA-seq data that simultaneously utilizes intercellular high-order structural information. scCDCG comprises three main components: (i) A graph embedding module utilizing deep cut-informed techniques, which effectively captures intercellular high-order structural information, overcoming the over-smoothing and inefficiency issues prevalent in prior graph neural network methods. (ii) A self-supervised learning module guided by optimal transport, tailored to accommodate the unique complexities of scRNA-seq data, specifically its high-dimension and high-sparsity. (iii) An autoencoder-based feature learning module that simplifies model complexity through effective dimension reduction and feature extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's superior performance and efficiency compared to 7 established models, underscoring scCDCG's potential as a transformative tool in scRNA-seq data analysis. Our code is available at: https://github.com/XPgogogo/scCDCG.
{"title":"scCDCG: Efficient Deep Structural Clustering for single-cell RNA-seq via Deep Cut-informed Graph Embedding","authors":"Ping Xu, Zhiyuan Ning, Meng Xiao, Guihai Feng, Xin Li, Yuanchun Zhou, Pengfei Wang","doi":"arxiv-2404.06167","DOIUrl":"https://doi.org/arxiv-2404.06167","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) is essential for unraveling cellular\u0000heterogeneity and diversity, offering invaluable insights for bioinformatics\u0000advancements. Despite its potential, traditional clustering methods in\u0000scRNA-seq data analysis often neglect the structural information embedded in\u0000gene expression profiles, crucial for understanding cellular correlations and\u0000dependencies. Existing strategies, including graph neural networks, face\u0000challenges in handling the inefficiency due to scRNA-seq data's intrinsic\u0000high-dimension and high-sparsity. Addressing these limitations, we introduce\u0000scCDCG (single-cell RNA-seq Clustering via Deep Cut-informed Graph), a novel\u0000framework designed for efficient and accurate clustering of scRNA-seq data that\u0000simultaneously utilizes intercellular high-order structural information. scCDCG\u0000comprises three main components: (i) A graph embedding module utilizing deep\u0000cut-informed techniques, which effectively captures intercellular high-order\u0000structural information, overcoming the over-smoothing and inefficiency issues\u0000prevalent in prior graph neural network methods. (ii) A self-supervised\u0000learning module guided by optimal transport, tailored to accommodate the unique\u0000complexities of scRNA-seq data, specifically its high-dimension and\u0000high-sparsity. (iii) An autoencoder-based feature learning module that\u0000simplifies model complexity through effective dimension reduction and feature\u0000extraction. Our extensive experiments on 6 datasets demonstrate scCDCG's\u0000superior performance and efficiency compared to 7 established models,\u0000underscoring scCDCG's potential as a transformative tool in scRNA-seq data\u0000analysis. Our code is available at: https://github.com/XPgogogo/scCDCG.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron
The wide array of currently available genomes display a wonderful diversity in size, composition and structure with many more to come thanks to several global biodiversity genomics initiatives starting in recent years. However, sequencing of genomes, even with all the recent advances, can still be challenging for both technical (e.g. small physical size, contaminated samples, or access to appropriate sequencing platforms) and biological reasons (e.g. germline restricted DNA, variable ploidy levels, sex chromosomes, or very large genomes). In recent years, k-mer-based techniques have become popular to overcome some of these challenges. They are based on the simple process of dividing the analysed sequences (e.g. raw reads or genomes) into a set of sub-sequences of length k, called k-mers. Despite this apparent simplicity, k-mer-based analysis allows for a rapid and intuitive assessment of complex sequencing datasets. Here, we provide the first comprehensive review to the theoretical properties and practical applications of k-mers in biodiversity genomics, serving as a reference manual for this powerful approach.
由于近年来开始的一些全球生物多样性基因组学计划,目前可用的大量基因组在大小、组成和结构上都呈现出了奇妙的多样性,而且还有更多的基因组即将问世。然而,即使基因组测序取得了最新进展,但由于技术(如物理尺寸小、样本受污染或难以获得合适的测序平台)和生物学(如种系受限 DNA、倍性水平不一、性染色体或超大基因组)等原因,测序工作仍然充满挑战。近年来,基于 k-mer的技术开始流行,以克服其中的一些挑战。它们的基础是将分析序列(如原始读数或基因组)划分为一组长度为 k 的子序列(称为 k-mers)的简单过程。尽管表面上看似简单,但基于 k 分子的分析可以快速、直观地评估复杂的测序数据集。在这里,我们首次全面评述了生物多样性基因组学中 k 分子的理论特性和实际应用,为这种强大的方法提供了参考手册。
{"title":"Guide to k-mer approaches for genomics across the tree of life","authors":"Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron","doi":"arxiv-2404.01519","DOIUrl":"https://doi.org/arxiv-2404.01519","url":null,"abstract":"The wide array of currently available genomes display a wonderful diversity\u0000in size, composition and structure with many more to come thanks to several\u0000global biodiversity genomics initiatives starting in recent years. However,\u0000sequencing of genomes, even with all the recent advances, can still be\u0000challenging for both technical (e.g. small physical size, contaminated samples,\u0000or access to appropriate sequencing platforms) and biological reasons (e.g.\u0000germline restricted DNA, variable ploidy levels, sex chromosomes, or very large\u0000genomes). In recent years, k-mer-based techniques have become popular to\u0000overcome some of these challenges. They are based on the simple process of\u0000dividing the analysed sequences (e.g. raw reads or genomes) into a set of\u0000sub-sequences of length k, called k-mers. Despite this apparent simplicity,\u0000k-mer-based analysis allows for a rapid and intuitive assessment of complex\u0000sequencing datasets. Here, we provide the first comprehensive review to the\u0000theoretical properties and practical applications of k-mers in biodiversity\u0000genomics, serving as a reference manual for this powerful approach.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kulaga AntonInstitute for Biostatistics and Informatics in Medicine and Ageing ResearchInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Borysova OlgaInternational Longevity AllianceCellFabrik SRL, Karmazin AlexeyInternational Longevity AllianceMitoSpace, Koval MariaInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Usanov NikolayInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Fedorova AlinaInstitute of Biochemistry of the Romanian Academy, Evfratov SergeyInstitute of Biochemistry of the Romanian Academy, Pushkareva MalvinaInstitute of Biochemistry of the Romanian Academy, Ryangguk KimOak Bioinformatics LLC, Tacutu RobiSecvADN SRL
Genomic data has become increasingly accessible to the general public with the advent of companies offering whole genome sequencing at a relatively low cost. However, their reports are not verifiable due to a lack of crucial details and transparency: polygenic risk scores do not always mention all the polymorphisms involved. Simultaneously, tackling the manual investigation and interpretation of data proves challenging for individuals lacking a background in genetics. Currently, there is no open-source or commercial solution that provides comprehensive longevity reports surpassing a limited number of polymorphisms. Additionally, there are no ready-made, out-of-the-box solutions available that require minimal expertise to generate reports independently. To address these issues, we have developed the Just-DNA-Seq open-source genomic platform. Just-DNA-Seq aims to provide a user-friendly solution to genome annotation by allowing users to upload their own VCF files and receive annotations of their genetic variants and polygenic risk scores related to longevity. We also created GeneticsGenie custom GPT that can answer genetics questions based on our modules. With the Just-DNA-Seq platform, we want to provide full information regarding the genetics of long life: disease-predisposing variants, that can reduce lifespan and manifest at different age (cardiovascular, oncological, neurodegenerative diseases, etc.), pro-longevity variants and longevity drug pharmacokinetics. In this research article, we will discuss the features and capabilities of Just-DNA-Seq, and how it can benefit individuals looking to understand and improve their health. It's crucial to note that the Just-DNA-Seq platform is exclusively intended for scientific and informational purposes and is not suitable for medical applications.
{"title":"Just-DNA-Seq, open-source personal genomics platform: longevity science for everyone","authors":"Kulaga AntonInstitute for Biostatistics and Informatics in Medicine and Ageing ResearchInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Borysova OlgaInternational Longevity AllianceCellFabrik SRL, Karmazin AlexeyInternational Longevity AllianceMitoSpace, Koval MariaInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Usanov NikolayInstitute of Biochemistry of the Romanian AcademyInternational Longevity Alliance, Fedorova AlinaInstitute of Biochemistry of the Romanian Academy, Evfratov SergeyInstitute of Biochemistry of the Romanian Academy, Pushkareva MalvinaInstitute of Biochemistry of the Romanian Academy, Ryangguk KimOak Bioinformatics LLC, Tacutu RobiSecvADN SRL","doi":"arxiv-2403.19087","DOIUrl":"https://doi.org/arxiv-2403.19087","url":null,"abstract":"Genomic data has become increasingly accessible to the general public with\u0000the advent of companies offering whole genome sequencing at a relatively low\u0000cost. However, their reports are not verifiable due to a lack of crucial\u0000details and transparency: polygenic risk scores do not always mention all the\u0000polymorphisms involved. Simultaneously, tackling the manual investigation and\u0000interpretation of data proves challenging for individuals lacking a background\u0000in genetics. Currently, there is no open-source or commercial solution that\u0000provides comprehensive longevity reports surpassing a limited number of\u0000polymorphisms. Additionally, there are no ready-made, out-of-the-box solutions\u0000available that require minimal expertise to generate reports independently. To\u0000address these issues, we have developed the Just-DNA-Seq open-source genomic\u0000platform. Just-DNA-Seq aims to provide a user-friendly solution to genome\u0000annotation by allowing users to upload their own VCF files and receive\u0000annotations of their genetic variants and polygenic risk scores related to\u0000longevity. We also created GeneticsGenie custom GPT that can answer genetics\u0000questions based on our modules. With the Just-DNA-Seq platform, we want to\u0000provide full information regarding the genetics of long life:\u0000disease-predisposing variants, that can reduce lifespan and manifest at\u0000different age (cardiovascular, oncological, neurodegenerative diseases, etc.),\u0000pro-longevity variants and longevity drug pharmacokinetics. In this research\u0000article, we will discuss the features and capabilities of Just-DNA-Seq, and how\u0000it can benefit individuals looking to understand and improve their health. It's\u0000crucial to note that the Just-DNA-Seq platform is exclusively intended for\u0000scientific and informational purposes and is not suitable for medical\u0000applications.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annotating the structure of protein-coding genes represents a major challenge in the analysis of eukaryotic genomes. This task sets the groundwork for subsequent genomic studies aimed at understanding the functions of individual genes. BRAKER and Galba are two fully automated and containerized pipelines designed to perform accurate genome annotation. BRAKER integrates the GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain high sensitivity and precision. BRAKER is adept at handling genomes of any size, provided that it has access to both transcript expression sequencing data and an extensive protein database from the target clade. In particular, BRAKER demonstrates high accuracy even with only one type of these extrinsic evidence sources, although it should be noted that accuracy diminishes for larger genomes under such conditions. In contrast, Galba adopts a distinct methodology utilizing the outcomes of direct protein-to-genome spliced alignments using miniprot to generate training genes and evidence for gene prediction in AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are the only source of evidence. This chapter provides practical guidelines for employing both pipelines in the annotation of eukaryotic genomes, with a focus on insect genomes.
注释蛋白质编码基因的结构是真核生物基因组分析中的一项重大挑战。这项任务为后续旨在了解单个基因功能的基因组研究奠定了基础。BRAKER 和 Galba 是两个全自动的容器化管道,旨在进行精确的基因组注释。BRAKER 集成了 GeneMark-ETP 和 AUGUSTUS 基因查找器,并采用 TSEBRA 组合器来实现高灵敏度和高精确度。BRAKER 擅长处理任何规模的基因组,前提是它能获得目标支系的转录本表达测序数据和大量蛋白质数据库。特别是,即使只有一种外在证据资源,BRAKER 也能表现出很高的准确性,不过需要注意的是,在这种条件下,较大基因组的准确性会降低。相比之下,Galba 采用了一种独特的方法,即利用 Miniprot 直接进行蛋白质与基因组剪接比对的结果来生成训练基因和证据,以便在 AUGUSTUS 中进行基因预测。如果蛋白质序列是唯一的证据来源,Galba 在大型基因组中具有更高的准确性。本章提供了在真核生物基因组注释中使用这两种管道的实用指南,重点是昆虫基因组。
{"title":"Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA","authors":"Tomáš Brůna, Lars Gabriel, Katharina J. Hoff","doi":"arxiv-2403.19416","DOIUrl":"https://doi.org/arxiv-2403.19416","url":null,"abstract":"Annotating the structure of protein-coding genes represents a major challenge\u0000in the analysis of eukaryotic genomes. This task sets the groundwork for\u0000subsequent genomic studies aimed at understanding the functions of individual\u0000genes. BRAKER and Galba are two fully automated and containerized pipelines\u0000designed to perform accurate genome annotation. BRAKER integrates the\u0000GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain\u0000high sensitivity and precision. BRAKER is adept at handling genomes of any\u0000size, provided that it has access to both transcript expression sequencing data\u0000and an extensive protein database from the target clade. In particular, BRAKER\u0000demonstrates high accuracy even with only one type of these extrinsic evidence\u0000sources, although it should be noted that accuracy diminishes for larger\u0000genomes under such conditions. In contrast, Galba adopts a distinct methodology\u0000utilizing the outcomes of direct protein-to-genome spliced alignments using\u0000miniprot to generate training genes and evidence for gene prediction in\u0000AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are\u0000the only source of evidence. This chapter provides practical guidelines for\u0000employing both pipelines in the annotation of eukaryotic genomes, with a focus\u0000on insect genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140324767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this investigation, five separate experiments were carried out. The first experiments were examined the molecular characteristics of 59 barley accessions collected from different regions in Iraq using three different molecular markers (ISSR, CDDP, and Scot). A total of 391 amplified polymorphic bands were generated using forty-four ISSR, nine CDDP, and twelve Scot primers, which they totally observed 255, 35, and 101 polymorphic bands respectively. The mean values of PIC for ISSR, CDDP, and Scot markers were 0.74, 0.63, and 0.80, respectively, indicating the efficiency of the underlying markers in detecting polymorphic status among the studied barley accessions. Based on the respective markers, the barley accessions were classified and clustered into two main groups using the UPGMA and population structure analysis. Results of claustral analyses showed that the variation patterns corresponded with the geographical distribution of barley accessions.
{"title":"Genetic diversity of barley accessions and their response under abiotic stresses using different approaches","authors":"Djshwar Dhahir Lateef, Nawroz Abdul-razzak Tahir","doi":"arxiv-2403.14181","DOIUrl":"https://doi.org/arxiv-2403.14181","url":null,"abstract":"In this investigation, five separate experiments were carried out. The first\u0000experiments were examined the molecular characteristics of 59 barley accessions\u0000collected from different regions in Iraq using three different molecular\u0000markers (ISSR, CDDP, and Scot). A total of 391 amplified polymorphic bands were\u0000generated using forty-four ISSR, nine CDDP, and twelve Scot primers, which they\u0000totally observed 255, 35, and 101 polymorphic bands respectively. The mean\u0000values of PIC for ISSR, CDDP, and Scot markers were 0.74, 0.63, and 0.80,\u0000respectively, indicating the efficiency of the underlying markers in detecting\u0000polymorphic status among the studied barley accessions. Based on the respective\u0000markers, the barley accessions were classified and clustered into two main\u0000groups using the UPGMA and population structure analysis. Results of claustral\u0000analyses showed that the variation patterns corresponded with the geographical\u0000distribution of barley accessions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongxiao Wang, Yang Yang, Zhuo Zhao, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen
For predicting cancer survival outcomes, standard approaches in clinical research are often based on two main modalities: pathology images for observing cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene expressions. However, existing pathology-genomic multi-modal algorithms face significant challenges: (1) Valuable biological insights regarding genes and gene-gene interactions are frequently overlooked; (2) one modality often dominates the optimization process, causing inadequate training for the other modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic" framework for cancer survival outcome prediction. First, to extract valuable biological insights, we regulate the embedding space of a foundation model, scGPT, initially trained on single-cell RNA-seq data, making it adaptable for bulk RNA-seq data. Second, to address the imbalance-between-modalities problem, we propose a gradient modulation mechanism tailored to the Cox partial likelihood loss for survival prediction. The contributions of the modalities are dynamically monitored and adjusted during the training process, encouraging that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer Genome Atlas) datasets, our model achieves substantially improved survival prediction accuracy.
{"title":"Path-GPTOmic: A Balanced Multi-modal Learning Framework for Survival Outcome Prediction","authors":"Hongxiao Wang, Yang Yang, Zhuo Zhao, Pengfei Gu, Nishchal Sapkota, Danny Z. Chen","doi":"arxiv-2403.11375","DOIUrl":"https://doi.org/arxiv-2403.11375","url":null,"abstract":"For predicting cancer survival outcomes, standard approaches in clinical\u0000research are often based on two main modalities: pathology images for observing\u0000cell morphology features, and genomic (e.g., bulk RNA-seq) for quantifying gene\u0000expressions. However, existing pathology-genomic multi-modal algorithms face\u0000significant challenges: (1) Valuable biological insights regarding genes and\u0000gene-gene interactions are frequently overlooked; (2) one modality often\u0000dominates the optimization process, causing inadequate training for the other\u0000modality. In this paper, we introduce a new multi-modal ``Path-GPTOmic\"\u0000framework for cancer survival outcome prediction. First, to extract valuable\u0000biological insights, we regulate the embedding space of a foundation model,\u0000scGPT, initially trained on single-cell RNA-seq data, making it adaptable for\u0000bulk RNA-seq data. Second, to address the imbalance-between-modalities problem,\u0000we propose a gradient modulation mechanism tailored to the Cox partial\u0000likelihood loss for survival prediction. The contributions of the modalities\u0000are dynamically monitored and adjusted during the training process, encouraging\u0000that both modalities are sufficiently trained. Evaluated on two TCGA(The Cancer\u0000Genome Atlas) datasets, our model achieves substantially improved survival\u0000prediction accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}