arXiv - QuanBio - Genomics最新文献_第2页

Gene and RNA Editing: Methods, Enabling Technologies, Applications, and Future Directions 基因和 RNA 编辑：方法、赋能技术、应用和未来方向

arXiv - QuanBio - Genomics

Pub Date : 2024-09-01 DOI: arxiv-2409.09057

Mohammed Aledhari, Mohamed Rahouti

Gene and RNA editing methods, technologies, and applications are emerging asinnovative forms of therapy and medicine, offering more efficientimplementation compared to traditional pharmaceutical treatments. Currenttrends emphasize the urgent need for advanced methods and technologies todetect public health threats, including diseases and viral agents. Gene and RNAediting techniques enhance the ability to identify, modify, and ameliorate theeffects of genetic diseases, disorders, and disabilities. Viral detection andidentification methods present numerous opportunities for enablingtechnologies, such as CRISPR, applicable to both RNA and gene editing throughthe use of specific Cas proteins. This article explores the distinctions andbenefits of RNA and gene editing processes, emphasizing their contributions tothe future of medical treatment. CRISPR technology, particularly its adaptationvia the Cas13 protein for RNA editing, is a significant advancement in geneediting. The article will delve into RNA and gene editing methodologies,focusing on techniques that alter and modify genetic coding. A-to-I and C-to-Uediting are currently the most predominant methods of RNA modification. CRISPRstands out as the most cost-effective and customizable technology for both RNAand gene editing. Unlike permanent changes induced by cutting an individual'sDNA genetic code, RNA editing offers temporary modifications by alteringnucleoside bases in RNA strands, which can then attach to DNA strands astemporary modifiers.

基因和 RNA 编辑方法、技术和应用正在成为一种创新的治疗和医学形式，与传统的药物治疗相比，其实施效率更高。当前的趋势强调迫切需要先进的方法和技术来检测公共卫生威胁，包括疾病和病毒病原体。基因和 RNA 编辑技术提高了识别、改变和改善遗传疾病、失调和残疾影响的能力。病毒检测和识别方法为 CRISPR 等赋能技术提供了大量机会，这些技术通过使用特异性 Cas 蛋白适用于 RNA 和基因编辑。本文探讨了 RNA 和基因编辑过程的区别和优势，强调了它们对未来医疗的贡献。CRISPR 技术，尤其是通过 Cas13 蛋白进行 RNA 编辑的技术，是基因编辑领域的一大进步。本文将深入探讨 RNA 和基因编辑方法，重点是改变和修改基因编码的技术。A-to-I和C-to-U编辑是目前最主要的RNA修饰方法。CRISPR 是 RNA 和基因编辑中最具成本效益和可定制的技术。与切割个体 DNA 遗传密码所引起的永久性改变不同，RNA 编辑通过改变 RNA 链中的核苷酸碱基来提供临时性修饰，然后将其附着到 DNA 链上，成为一个临时修饰系统。

{"title":"Gene and RNA Editing: Methods, Enabling Technologies, Applications, and Future Directions","authors":"Mohammed Aledhari, Mohamed Rahouti","doi":"arxiv-2409.09057","DOIUrl":"https://doi.org/arxiv-2409.09057","url":null,"abstract":"Gene and RNA editing methods, technologies, and applications are emerging as\u0000innovative forms of therapy and medicine, offering more efficient\u0000implementation compared to traditional pharmaceutical treatments. Current\u0000trends emphasize the urgent need for advanced methods and technologies to\u0000detect public health threats, including diseases and viral agents. Gene and RNA\u0000editing techniques enhance the ability to identify, modify, and ameliorate the\u0000effects of genetic diseases, disorders, and disabilities. Viral detection and\u0000identification methods present numerous opportunities for enabling\u0000technologies, such as CRISPR, applicable to both RNA and gene editing through\u0000the use of specific Cas proteins. This article explores the distinctions and\u0000benefits of RNA and gene editing processes, emphasizing their contributions to\u0000the future of medical treatment. CRISPR technology, particularly its adaptation\u0000via the Cas13 protein for RNA editing, is a significant advancement in gene\u0000editing. The article will delve into RNA and gene editing methodologies,\u0000focusing on techniques that alter and modify genetic coding. A-to-I and C-to-U\u0000editing are currently the most predominant methods of RNA modification. CRISPR\u0000stands out as the most cost-effective and customizable technology for both RNA\u0000and gene editing. Unlike permanent changes induced by cutting an individual's\u0000DNA genetic code, RNA editing offers temporary modifications by altering\u0000nucleoside bases in RNA strands, which can then attach to DNA strands as\u0000temporary modifiers.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis 用于单细胞 ATAC-seq 数据生成和分析的多功能信息扩散模型

arXiv - QuanBio - Genomics

Pub Date : 2024-08-27 DOI: arxiv-2408.14801

Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis

The rapid advancement of single-cell ATAC sequencing (scATAC-seq)technologies holds great promise for investigating the heterogeneity ofepigenetic landscapes at the cellular level. The amplification process inscATAC-seq experiments often introduces noise due to dropout events, whichresults in extreme sparsity that hinders accurate analysis. Consequently, thereis a significant demand for the generation of high-quality scATAC-seq data insilico. Furthermore, current methodologies are typically task-specific, lackinga versatile framework capable of handling multiple tasks within a single model.In this work, we propose ATAC-Diff, a versatile framework, which is based on alatent diffusion model conditioned on the latent auxiliary variables to adaptfor various tasks. ATAC-Diff is the first diffusion model for the scATAC-seqdata generation and analysis, composed of auxiliary modules encoding the latenthigh-level variables to enable the model to learn the semantic information tosample high-quality data. Gaussian Mixture Model (GMM) as the latent prior andauxiliary decoder, the yield variables reserve the refined genomic informationbeneficial for downstream analyses. Another innovation is the incorporation ofmutual information between observed and hidden variables as a regularizationterm to prevent the model from decoupling from latent variables. Throughextensive experiments, we demonstrate that ATAC-Diff achieves high performancein both generation and analysis tasks, outperforming state-of-the-art models.

单细胞ATAC测序（scATAC-seq）技术的迅速发展为研究细胞水平表观遗传景观的异质性带来了巨大的希望。scATAC-seq 实验的扩增过程往往会因丢弃事件而引入噪声，从而导致极度稀疏，阻碍了精确分析。因此，对在内部生成高质量的 scATAC-seq 数据有很大的需求。在这项工作中，我们提出了 ATAC-Diff，一个基于潜在辅助变量条件的潜在扩散模型的多功能框架，以适应各种任务。ATAC-Diff 是第一个用于 scATAC-seq 数据生成和分析的扩散模型，由编码潜在高层次变量的辅助模块组成，使模型能够学习语义信息，从而对高质量数据进行采样。高斯混杂模型（GMM）作为潜在先验和辅助解码器，产生的变量保留了精炼的基因组信息，有利于下游分析。另一项创新是将观测变量和隐藏变量之间的相互信息作为正则化项，以防止模型与潜在变量脱钩。通过大量的实验，我们证明 ATAC-Diff 在生成和分析任务中都取得了很高的性能，超过了最先进的模型。

{"title":"A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis","authors":"Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis","doi":"arxiv-2408.14801","DOIUrl":"https://doi.org/arxiv-2408.14801","url":null,"abstract":"The rapid advancement of single-cell ATAC sequencing (scATAC-seq)\u0000technologies holds great promise for investigating the heterogeneity of\u0000epigenetic landscapes at the cellular level. The amplification process in\u0000scATAC-seq experiments often introduces noise due to dropout events, which\u0000results in extreme sparsity that hinders accurate analysis. Consequently, there\u0000is a significant demand for the generation of high-quality scATAC-seq data in\u0000silico. Furthermore, current methodologies are typically task-specific, lacking\u0000a versatile framework capable of handling multiple tasks within a single model.\u0000In this work, we propose ATAC-Diff, a versatile framework, which is based on a\u0000latent diffusion model conditioned on the latent auxiliary variables to adapt\u0000for various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq\u0000data generation and analysis, composed of auxiliary modules encoding the latent\u0000high-level variables to enable the model to learn the semantic information to\u0000sample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and\u0000auxiliary decoder, the yield variables reserve the refined genomic information\u0000beneficial for downstream analyses. Another innovation is the incorporation of\u0000mutual information between observed and hidden variables as a regularization\u0000term to prevent the model from decoupling from latent variables. Through\u0000extensive experiments, we demonstrate that ATAC-Diff achieves high performance\u0000in both generation and analysis tasks, outperforming state-of-the-art models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HEK-Omics: The promise of omics to optimize HEK293 for recombinant adeno-associated virus (rAAV) gene therapy manufacturing HEK-Omics：omics有望优化用于重组腺相关病毒（rAAV）基因治疗生产的HEK293

arXiv - QuanBio - Genomics

Pub Date : 2024-08-23 DOI: arxiv-2408.13374

Sai Guna Ranjan Gurazada, Hannah M. Kennedy, Richard D. Braatz, Steven J. Mehrman, Shawn W. Polson, Irene T. Rombel

Gene therapy is poised to transition from niche to mainstream medicine, withrecombinant adeno-associated virus (rAAV) as the vector of choice. However,this requires robust, scalable, industrialized production to meet demand andprovide affordable patient access, which has thus far failed to materialize.Closing the chasm between demand and supply requires innovation inbiomanufacturing to achieve the essential step change in rAAV product yield andquality. Omics provides a rich source of mechanistic knowledge that can beapplied to HEK293, the prevailing cell line for rAAV production. In thisreview, the findings from a growing number of disparate studies that applygenomics, epigenomics, transcriptomics, proteomics, and metabolomics to HEK293bioproduction are explored. Learnings from CHO-Omics, application of omicsapproaches to improve CHO bioproduction, provide context for the potential of"HEK-Omics" as a multiomics-informed approach providing actionable mechanisticinsights for improved transient and stable production of rAAV and otherrecombinant products in HEK293.

基因疗法正准备从小众医学过渡到主流医学，而重组腺相关病毒（rAAV）则是首选载体。然而，这需要强大、可扩展的工业化生产来满足需求，并为患者提供可负担得起的治疗机会，而这一点迄今尚未实现。要弥合供需之间的鸿沟，就必须在生物制造方面进行创新，以实现 rAAV 产品产量和质量的根本性转变。Omics 提供了丰富的机理知识，这些知识可应用于生产 rAAV 的主流细胞系 HEK293。在这篇综述中，我们探讨了越来越多不同研究的发现，这些研究将基因组学、表观基因组学、转录组学、蛋白质组学和代谢组学应用于 HEK293 生物生产。从 "CHO-Omics"（应用组学方法改进 CHO 生物生产）中汲取的经验为 "HEK-Omics "的潜力提供了背景，"HEK-Omics "是一种多组学知情方法，可为改进 HEK293 中 rAAV 和其他重组产品的瞬时和稳定生产提供可操作的机理见解。

{"title":"HEK-Omics: The promise of omics to optimize HEK293 for recombinant adeno-associated virus (rAAV) gene therapy manufacturing","authors":"Sai Guna Ranjan Gurazada, Hannah M. Kennedy, Richard D. Braatz, Steven J. Mehrman, Shawn W. Polson, Irene T. Rombel","doi":"arxiv-2408.13374","DOIUrl":"https://doi.org/arxiv-2408.13374","url":null,"abstract":"Gene therapy is poised to transition from niche to mainstream medicine, with\u0000recombinant adeno-associated virus (rAAV) as the vector of choice. However,\u0000this requires robust, scalable, industrialized production to meet demand and\u0000provide affordable patient access, which has thus far failed to materialize.\u0000Closing the chasm between demand and supply requires innovation in\u0000biomanufacturing to achieve the essential step change in rAAV product yield and\u0000quality. Omics provides a rich source of mechanistic knowledge that can be\u0000applied to HEK293, the prevailing cell line for rAAV production. In this\u0000review, the findings from a growing number of disparate studies that apply\u0000genomics, epigenomics, transcriptomics, proteomics, and metabolomics to HEK293\u0000bioproduction are explored. Learnings from CHO-Omics, application of omics\u0000approaches to improve CHO bioproduction, provide context for the potential of\u0000\"HEK-Omics\" as a multiomics-informed approach providing actionable mechanistic\u0000insights for improved transient and stable production of rAAV and other\u0000recombinant products in HEK293.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"388 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Superimposed Hi-C: A Solution Proposed for Identifying Single Cell's Chromosomal Interactions 叠加 Hi-C：为识别单细胞染色体相互作用而提出的解决方案

arXiv - QuanBio - Genomics

Pub Date : 2024-08-23 DOI: arxiv-2408.13039

Jia Zhang, Li Xiao, Peng Qi, Yaling Zeng, Xumeng Chen, Duan-fang Liao, Kai Li

Hi-C sequencing is widely used for analyzing chromosomal interactions. Inthis study, we propose "superimposed Hi-C" which features paired EcoP15I sitesin a linker to facilitate sticky-end ligation with target DNAs. SuperimposedHi-C overcomes Hi-C's technical limitations, enabling the identification ofsingle cell's chromosomal interactions.

Hi-C 测序被广泛用于分析染色体相互作用。在本研究中，我们提出了 "叠加Hi-C"，其特点是在连接子中加入成对的EcoP15I位点，以促进与目标DNA的粘端连接。叠加Hi-C克服了Hi-C的技术局限性，可鉴定单个细胞的染色体相互作用。

引用次数: 0

Wave-LSTM: Multi-scale analysis of somatic whole genome copy number profiles Wave-LSTM：体细胞全基因组拷贝数图谱的多尺度分析

arXiv - QuanBio - Genomics

Pub Date : 2024-08-22 DOI: arxiv-2408.12636

Charles Gadd, Christopher Yau

Changes in the number of copies of certain parts of the genome, known as copynumber alterations (CNAs), due to somatic mutation processes are a hallmark ofmany cancers. This genomic complexity is known to be associated with pooreroutcomes for patients but describing its contribution in detail has beendifficult. Copy number alterations can affect large regions spanning wholechromosomes or the entire genome itself but can also be localised to only smallsegments of the genome and no methods exist that allow this multi-scale natureto be quantified. In this paper, we address this using Wave-LSTM, a signaldecomposition approach designed to capture the multi-scale structure of complexwhole genome copy number profiles. Using wavelet-based source separation incombination with deep learning-based attention mechanisms. We show thatWave-LSTM can be used to derive multi-scale representations from copy numberprofiles which can be used to decipher sub-clonal structures from single-cellcopy number data and to improve survival prediction performance from patienttumour profiles.

体细胞突变过程导致基因组某些部分的拷贝数发生变化，即所谓的拷贝数改变（CNA），是许多癌症的标志。众所周知，这种基因组复杂性与患者较差的预后有关，但要详细描述它的作用却很困难。拷贝数改变可以影响横跨整个染色体或整个基因组本身的大区域，但也可能只局限于基因组的小片段，而目前还没有任何方法可以量化这种多尺度性质。在本文中，我们使用 Wave-LSTM 解决了这一问题，这是一种信号分解方法，旨在捕捉复杂的全基因组拷贝数图谱的多尺度结构。我们将基于小波的源分离与基于深度学习的注意机制相结合。我们的研究表明，Wave-LSTM 可用于从拷贝数图谱中推导出多尺度表征，从而从单细胞拷贝数数据中解密亚克隆结构，并提高患者肿瘤图谱的生存预测性能。

引用次数: 0

Comparison of algorithms used in single-cell transcriptomic data analysis 单细胞转录组数据分析所用算法的比较

arXiv - QuanBio - Genomics

Pub Date : 2024-08-21 DOI: arxiv-2408.12031

Jafar Isbarov, Elmir Mahammadov

Single-cell analysis is an increasingly relevant approach in "omics''studies. In the last decade, it has been applied to various fields, includingcancer biology, neuroscience, and, especially, developmental biology. This risein popularity has been accompanied with creation of modern software,development of new pipelines and design of new algorithms. Many establishedalgorithms have also been applied with varying levels of effectiveness.Currently, there is an abundance of algorithms for all steps of the generalworkflow. While some scientists use ready-made pipelines (such as Seurat),manual analysis is popular, too, as it allows more flexibility. Scientists whoperform their own analysis face multiple options when it comes to the choice ofalgorithms. We have used two different datasets to test some of the mostwidely-used algorithms. In this paper, we are going to report the maindifferences between them, suggest a minimal number of algorithms for each step,and explain our suggestions. In certain stages, it is impossible to make aclear choice without further context. In these cases, we are going to explorethe major possibilities, and make suggestions for each one of them.

单细胞分析在 "omics''研究中是一种越来越重要的方法。在过去十年中，它已被应用于多个领域，包括癌症生物学、神经科学，尤其是发育生物学。随着这种方法的普及，现代软件的开发、新流水线的开发和新算法的设计也随之兴起。目前，有大量算法可用于一般工作流程的所有步骤。虽然有些科学家使用现成的管道（如 Seurat），但手动分析也很流行，因为它具有更大的灵活性。自己进行分析的科学家在选择算法时面临多种选择。我们使用了两个不同的数据集来测试一些使用最广泛的算法。在本文中，我们将报告它们之间的主要差异，为每个步骤建议最少数量的算法，并解释我们的建议。在某些阶段，如果没有进一步的背景知识，就无法做出明确的选择。在这种情况下，我们将探讨主要的可能性，并针对每一种可能性提出建议。

{"title":"Comparison of algorithms used in single-cell transcriptomic data analysis","authors":"Jafar Isbarov, Elmir Mahammadov","doi":"arxiv-2408.12031","DOIUrl":"https://doi.org/arxiv-2408.12031","url":null,"abstract":"Single-cell analysis is an increasingly relevant approach in \"omics''\u0000studies. In the last decade, it has been applied to various fields, including\u0000cancer biology, neuroscience, and, especially, developmental biology. This rise\u0000in popularity has been accompanied with creation of modern software,\u0000development of new pipelines and design of new algorithms. Many established\u0000algorithms have also been applied with varying levels of effectiveness.\u0000Currently, there is an abundance of algorithms for all steps of the general\u0000workflow. While some scientists use ready-made pipelines (such as Seurat),\u0000manual analysis is popular, too, as it allows more flexibility. Scientists who\u0000perform their own analysis face multiple options when it comes to the choice of\u0000algorithms. We have used two different datasets to test some of the most\u0000widely-used algorithms. In this paper, we are going to report the main\u0000differences between them, suggest a minimal number of algorithms for each step,\u0000and explain our suggestions. In certain stages, it is impossible to make a\u0000clear choice without further context. In these cases, we are going to explore\u0000the major possibilities, and make suggestions for each one of them.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Single-cell Curriculum Learning-based Deep Graph Embedding Clustering 基于单细胞课程学习的深度图嵌入式聚类

arXiv - QuanBio - Genomics

Pub Date : 2024-08-20 DOI: arxiv-2408.10511

Huifa Li, Jie Fu, Xinpeng Ling, Zhiyu Sun, Kuncan Wang, Zhili Chen

The swift advancement of single-cell RNA sequencing (scRNA-seq) technologiesenables the investigation of cellular-level tissue heterogeneity. Cellannotation significantly contributes to the extensive downstream analysis ofscRNA-seq data. However, The analysis of scRNA-seq for biological inferencepresents challenges owing to its intricate and indeterminate data distribution,characterized by a substantial volume and a high frequency of dropout events.Furthermore, the quality of training samples varies greatly, and theperformance of the popular scRNA-seq data clustering solution GNN could beharmed by two types of low-quality training nodes: 1) nodes on the boundary; 2)nodes that contribute little additional information to the graph. To addressthese problems, we propose a single-cell curriculum learning-based deep graphembedding clustering (scCLG). We first propose a Chebyshev graph convolutionalautoencoder with multi-decoder (ChebAE) that combines three optimizationobjectives corresponding to three decoders, including topology reconstructionloss of cell graphs, zero-inflated negative binomial (ZINB) loss, andclustering loss, to learn cell-cell topology representation. Meanwhile, weemploy a selective training strategy to train GNN based on the features andentropy of nodes and prune the difficult nodes based on the difficulty scoresto keep the high-quality graph. Empirical results on a variety of geneexpression datasets show that our model outperforms state-of-the-art methods.

单细胞 RNA 测序（scRNA-seq）技术的迅猛发展使研究细胞级组织异质性成为可能。细胞注释大大有助于对 scRNA-seq 数据进行广泛的下游分析。此外，训练样本的质量参差不齐，流行的 scRNA-seq 数据聚类解决方案 GNN 的性能可能会受到两类低质量训练节点的影响：1）边界上的节点；2）对图贡献很少额外信息的节点。为了解决这些问题，我们提出了一种基于单细胞课程学习的深度图标聚类（sCLG）。我们首先提出了一种带多解码器的切比雪夫图卷积自动编码器（ChebAE），它结合了与三个解码器相对应的三个优化目标，包括细胞图拓扑重建损失、零膨胀负二项式（ZINB）损失和聚类损失，以学习细胞-细胞拓扑表示。同时，我们采用选择性训练策略，根据节点的特征和熵来训练 GNN，并根据难度评分来剪切困难的节点，以保持高质量的图。在各种基因表达数据集上的实证结果表明，我们的模型优于最先进的方法。

{"title":"Single-cell Curriculum Learning-based Deep Graph Embedding Clustering","authors":"Huifa Li, Jie Fu, Xinpeng Ling, Zhiyu Sun, Kuncan Wang, Zhili Chen","doi":"arxiv-2408.10511","DOIUrl":"https://doi.org/arxiv-2408.10511","url":null,"abstract":"The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies\u0000enables the investigation of cellular-level tissue heterogeneity. Cell\u0000annotation significantly contributes to the extensive downstream analysis of\u0000scRNA-seq data. However, The analysis of scRNA-seq for biological inference\u0000presents challenges owing to its intricate and indeterminate data distribution,\u0000characterized by a substantial volume and a high frequency of dropout events.\u0000Furthermore, the quality of training samples varies greatly, and the\u0000performance of the popular scRNA-seq data clustering solution GNN could be\u0000harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2)\u0000nodes that contribute little additional information to the graph. To address\u0000these problems, we propose a single-cell curriculum learning-based deep graph\u0000embedding clustering (scCLG). We first propose a Chebyshev graph convolutional\u0000autoencoder with multi-decoder (ChebAE) that combines three optimization\u0000objectives corresponding to three decoders, including topology reconstruction\u0000loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and\u0000clustering loss, to learn cell-cell topology representation. Meanwhile, we\u0000employ a selective training strategy to train GNN based on the features and\u0000entropy of nodes and prune the difficult nodes based on the difficulty scores\u0000to keep the high-quality graph. Empirical results on a variety of gene\u0000expression datasets show that our model outperforms state-of-the-art methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection 对增强型基因表达谱进行元学习以提高肺癌检测能力

arXiv - QuanBio - Genomics

Pub Date : 2024-08-19 DOI: arxiv-2408.09635

Arya Hadizadeh Moghaddam, Mohsen Nayebi Kerdabadi, Cuncong Zhong, Zijun Yao

Gene expression profiles obtained through DNA microarray have provensuccessful in providing critical information for cancer detection classifiers.However, the limited number of samples in these datasets poses a challenge toemploy complex methodologies such as deep neural networks for sophisticatedanalysis. To address this "small data" dilemma, Meta-Learning has beenintroduced as a solution to enhance the optimization of machine learning modelsby utilizing similar datasets, thereby facilitating a quicker adaptation totarget datasets without the requirement of sufficient samples. In this study,we present a meta-learning-based approach for predicting lung cancer from geneexpression profiles. We apply this framework to well-established deep learningmethodologies and employ four distinct datasets for the meta-learning tasks,where one as the target dataset and the rest as source datasets. Our approachis evaluated against both traditional and deep learning methodologies, and theresults show the superior performance of meta-learning on augmented source datacompared to the baselines trained on single datasets. Moreover, we conduct thecomparative analysis between meta-learning and transfer learning methodologiesto highlight the efficiency of the proposed approach in addressing thechallenges associated with limited sample sizes. Finally, we incorporate theexplainability study to illustrate the distinctiveness of decisions made bymeta-learning.

通过 DNA 微阵列获得的基因表达谱已被证明能成功地为癌症检测分类器提供关键信息。然而，这些数据集中的样本数量有限，这对采用深度神经网络等复杂方法进行精密分析构成了挑战。为了解决这种 "小数据 "困境，元学习被引入作为一种解决方案，通过利用相似数据集来加强机器学习模型的优化，从而在不需要足够样本的情况下更快地适应目标数据集。在本研究中，我们提出了一种基于元学习的方法，用于从基因表达谱预测肺癌。我们将这一框架应用于成熟的深度学习方法，并采用四个不同的数据集来完成元学习任务，其中一个作为目标数据集，其余的作为源数据集。我们的方法与传统方法和深度学习方法进行了对比评估，结果表明元学习在增强源数据上的性能优于在单一数据集上训练的基线。此外，我们还对元学习和迁移学习方法进行了比较分析，以突出所提方法在解决有限样本量相关挑战方面的效率。最后，我们纳入了可解释性研究，以说明元学习所做决策的独特性。

{"title":"Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection","authors":"Arya Hadizadeh Moghaddam, Mohsen Nayebi Kerdabadi, Cuncong Zhong, Zijun Yao","doi":"arxiv-2408.09635","DOIUrl":"https://doi.org/arxiv-2408.09635","url":null,"abstract":"Gene expression profiles obtained through DNA microarray have proven\u0000successful in providing critical information for cancer detection classifiers.\u0000However, the limited number of samples in these datasets poses a challenge to\u0000employ complex methodologies such as deep neural networks for sophisticated\u0000analysis. To address this \"small data\" dilemma, Meta-Learning has been\u0000introduced as a solution to enhance the optimization of machine learning models\u0000by utilizing similar datasets, thereby facilitating a quicker adaptation to\u0000target datasets without the requirement of sufficient samples. In this study,\u0000we present a meta-learning-based approach for predicting lung cancer from gene\u0000expression profiles. We apply this framework to well-established deep learning\u0000methodologies and employ four distinct datasets for the meta-learning tasks,\u0000where one as the target dataset and the rest as source datasets. Our approach\u0000is evaluated against both traditional and deep learning methodologies, and the\u0000results show the superior performance of meta-learning on augmented source data\u0000compared to the baselines trained on single datasets. Moreover, we conduct the\u0000comparative analysis between meta-learning and transfer learning methodologies\u0000to highlight the efficiency of the proposed approach in addressing the\u0000challenges associated with limited sample sizes. Finally, we incorporate the\u0000explainability study to illustrate the distinctiveness of decisions made by\u0000meta-learning.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantum Annealing for Enhanced Feature Selection in Single-Cell RNA Sequencing Data Analysis 在单细胞 RNA 测序数据分析中增强特征选择的量子退火法

arXiv - QuanBio - Genomics

Pub Date : 2024-08-16 DOI: arxiv-2408.08867

Selim Romero, Shreyan Gupta, Victoria Gatlin, Robert S. Chapkin, James J. Cai

Feature selection is vital for identifying relevant variables inclassification and regression models, especially in single-cell RNA sequencing(scRNA-seq) data analysis. Traditional methods like LASSO often struggle withthe nonlinearities and multicollinearities in scRNA-seq data due to complexgene expression and extensive gene interactions. Quantum annealing, a form ofquantum computing, offers a promising solution. In this study, we apply quantumannealing-empowered quadratic unconstrained binary optimization (QUBO) forfeature selection in scRNA-seq data. Using data from a human celldifferentiation system, we show that QUBO identifies genes with nonlinearexpression patterns related to differentiation time, many of which play rolesin the differentiation process. In contrast, LASSO tends to select genes withmore linear expression changes. Our findings suggest that the QUBO method,powered by quantum annealing, can reveal complex gene expression patterns thattraditional methods might overlook, enhancing scRNA-seq data analysis andinterpretation.

特征选择对于识别分类和回归模型中的相关变量至关重要，尤其是在单细胞 RNA 测序（scRNA-seq）数据分析中。由于复杂的基因表达和广泛的基因相互作用，scRNA-seq 数据中的非线性和多共线性问题常常令 LASSO 等传统方法束手无策。量子退火作为量子计算的一种形式，提供了一种很有前景的解决方案。在这项研究中，我们将量子退火赋能的二次无约束二元优化（QUBO）应用于 scRNA-seq 数据的特征选择。通过使用人类细胞分化系统的数据，我们发现 QUBO 能识别与分化时间相关的非线性表达模式的基因，其中许多基因在分化过程中发挥作用。相比之下，LASSO 则倾向于选择表达变化更具线性的基因。我们的研究结果表明，量子退火的 QUBO 方法可以揭示传统方法可能忽略的复杂基因表达模式，从而提高 scRNA-seq 数据分析和解释能力。

{"title":"Quantum Annealing for Enhanced Feature Selection in Single-Cell RNA Sequencing Data Analysis","authors":"Selim Romero, Shreyan Gupta, Victoria Gatlin, Robert S. Chapkin, James J. Cai","doi":"arxiv-2408.08867","DOIUrl":"https://doi.org/arxiv-2408.08867","url":null,"abstract":"Feature selection is vital for identifying relevant variables in\u0000classification and regression models, especially in single-cell RNA sequencing\u0000(scRNA-seq) data analysis. Traditional methods like LASSO often struggle with\u0000the nonlinearities and multicollinearities in scRNA-seq data due to complex\u0000gene expression and extensive gene interactions. Quantum annealing, a form of\u0000quantum computing, offers a promising solution. In this study, we apply quantum\u0000annealing-empowered quadratic unconstrained binary optimization (QUBO) for\u0000feature selection in scRNA-seq data. Using data from a human cell\u0000differentiation system, we show that QUBO identifies genes with nonlinear\u0000expression patterns related to differentiation time, many of which play roles\u0000in the differentiation process. In contrast, LASSO tends to select genes with\u0000more linear expression changes. Our findings suggest that the QUBO method,\u0000powered by quantum annealing, can reveal complex gene expression patterns that\u0000traditional methods might overlook, enhancing scRNA-seq data analysis and\u0000interpretation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks 通过 scRNA-seq 发现泛癌症基因组，优化基于深度学习的下游任务

arXiv - QuanBio - Genomics

Pub Date : 2024-08-13 DOI: arxiv-2408.07233

Jong Hyun Kim, Jongseong Jang

The application of machine learning to transcriptomics data has led tosignificant advances in cancer research. However, the high dimensionality andcomplexity of RNA sequencing (RNA-seq) data pose significant challenges inpan-cancer studies. This study hypothesizes that gene sets derived fromsingle-cell RNA sequencing (scRNA-seq) data will outperform those selectedusing bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq datafrom 181 tumor biopsies across 13 cancer types. High-dimensional weighted geneco-expression network analysis (hdWGCNA) was performed to identify relevantgene sets, which were further refined using XGBoost for feature selection.These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seqdata and compared to six reference gene sets and oncogenes from OncoKBevaluated with deep learning models, including multilayer perceptrons (MLPs)and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene setdemonstrated higher performance in most tasks, including tumor mutation burdenassessment, microsatellite instability classification, mutation prediction,cancer subtyping, and grading. In particular, genes such as DPM1, BAD, andFKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistentlysignificant across tasks. This study presents a robust approach for featureselection in cancer genomics by integrating scRNA-seq data and advancedanalysis techniques, offering a promising avenue for improving predictiveaccuracy in cancer research.

机器学习在转录组学数据中的应用使癌症研究取得了重大进展。然而，RNA 测序（RNA-seq）数据的高维度和复杂性给泛癌症研究带来了巨大挑战。本研究假设，在泛癌症下游任务中，从单细胞 RNA 测序（scRNA-seq）数据中获得的基因组将优于用大容量 RNA-seq 选出的基因组。我们分析了来自 13 种癌症类型的 181 例肿瘤活检的 scRNA-seq 数据。我们使用 TCGA 泛癌症 RNA-seq 数据将这些基因组应用于下游任务，并与用深度学习模型（包括多层感知器（MLP）和图神经网络（GNN））评估的六个参考基因组和来自 OncoKB 的癌基因进行比较。XGBoost 精炼的 hdWGCNA 基因集在大多数任务中都表现出更高的性能，包括肿瘤突变负担评估、微卫星不稳定性分类、突变预测、癌症亚型和分级。特别是，DPM1、BAD 和 FKBP4 等基因成为重要的泛癌症生物标记物，其中 DPM1 在各种任务中始终具有显著性。这项研究通过整合 scRNA-seq 数据和先进的分析技术，为癌症基因组学中的特征选择提供了一种稳健的方法，为提高癌症研究的预测准确性提供了一条前景广阔的途径。

{"title":"Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks","authors":"Jong Hyun Kim, Jongseong Jang","doi":"arxiv-2408.07233","DOIUrl":"https://doi.org/arxiv-2408.07233","url":null,"abstract":"The application of machine learning to transcriptomics data has led to\u0000significant advances in cancer research. However, the high dimensionality and\u0000complexity of RNA sequencing (RNA-seq) data pose significant challenges in\u0000pan-cancer studies. This study hypothesizes that gene sets derived from\u0000single-cell RNA sequencing (scRNA-seq) data will outperform those selected\u0000using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data\u0000from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene\u0000co-expression network analysis (hdWGCNA) was performed to identify relevant\u0000gene sets, which were further refined using XGBoost for feature selection.\u0000These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq\u0000data and compared to six reference gene sets and oncogenes from OncoKB\u0000evaluated with deep learning models, including multilayer perceptrons (MLPs)\u0000and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set\u0000demonstrated higher performance in most tasks, including tumor mutation burden\u0000assessment, microsatellite instability classification, mutation prediction,\u0000cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and\u0000FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently\u0000significant across tasks. This study presents a robust approach for feature\u0000selection in cancer genomics by integrating scRNA-seq data and advanced\u0000analysis techniques, offering a promising avenue for improving predictive\u0000accuracy in cancer research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0