Gene and RNA editing methods, technologies, and applications are emerging as innovative forms of therapy and medicine, offering more efficient implementation compared to traditional pharmaceutical treatments. Current trends emphasize the urgent need for advanced methods and technologies to detect public health threats, including diseases and viral agents. Gene and RNA editing techniques enhance the ability to identify, modify, and ameliorate the effects of genetic diseases, disorders, and disabilities. Viral detection and identification methods present numerous opportunities for enabling technologies, such as CRISPR, applicable to both RNA and gene editing through the use of specific Cas proteins. This article explores the distinctions and benefits of RNA and gene editing processes, emphasizing their contributions to the future of medical treatment. CRISPR technology, particularly its adaptation via the Cas13 protein for RNA editing, is a significant advancement in gene editing. The article will delve into RNA and gene editing methodologies, focusing on techniques that alter and modify genetic coding. A-to-I and C-to-U editing are currently the most predominant methods of RNA modification. CRISPR stands out as the most cost-effective and customizable technology for both RNA and gene editing. Unlike permanent changes induced by cutting an individual's DNA genetic code, RNA editing offers temporary modifications by altering nucleoside bases in RNA strands, which can then attach to DNA strands as temporary modifiers.
基因和 RNA 编辑方法、技术和应用正在成为一种创新的治疗和医学形式,与传统的药物治疗相比,其实施效率更高。当前的趋势强调迫切需要先进的方法和技术来检测公共卫生威胁,包括疾病和病毒病原体。基因和 RNA 编辑技术提高了识别、改变和改善遗传疾病、失调和残疾影响的能力。病毒检测和识别方法为 CRISPR 等赋能技术提供了大量机会,这些技术通过使用特异性 Cas 蛋白适用于 RNA 和基因编辑。本文探讨了 RNA 和基因编辑过程的区别和优势,强调了它们对未来医疗的贡献。CRISPR 技术,尤其是通过 Cas13 蛋白进行 RNA 编辑的技术,是基因编辑领域的一大进步。本文将深入探讨 RNA 和基因编辑方法,重点是改变和修改基因编码的技术。A-to-I和C-to-U编辑是目前最主要的RNA修饰方法。CRISPR 是 RNA 和基因编辑中最具成本效益和可定制的技术。与切割个体 DNA 遗传密码所引起的永久性改变不同,RNA 编辑通过改变 RNA 链中的核苷酸碱基来提供临时性修饰,然后将其附着到 DNA 链上,成为一个临时修饰系统。
{"title":"Gene and RNA Editing: Methods, Enabling Technologies, Applications, and Future Directions","authors":"Mohammed Aledhari, Mohamed Rahouti","doi":"arxiv-2409.09057","DOIUrl":"https://doi.org/arxiv-2409.09057","url":null,"abstract":"Gene and RNA editing methods, technologies, and applications are emerging as\u0000innovative forms of therapy and medicine, offering more efficient\u0000implementation compared to traditional pharmaceutical treatments. Current\u0000trends emphasize the urgent need for advanced methods and technologies to\u0000detect public health threats, including diseases and viral agents. Gene and RNA\u0000editing techniques enhance the ability to identify, modify, and ameliorate the\u0000effects of genetic diseases, disorders, and disabilities. Viral detection and\u0000identification methods present numerous opportunities for enabling\u0000technologies, such as CRISPR, applicable to both RNA and gene editing through\u0000the use of specific Cas proteins. This article explores the distinctions and\u0000benefits of RNA and gene editing processes, emphasizing their contributions to\u0000the future of medical treatment. CRISPR technology, particularly its adaptation\u0000via the Cas13 protein for RNA editing, is a significant advancement in gene\u0000editing. The article will delve into RNA and gene editing methodologies,\u0000focusing on techniques that alter and modify genetic coding. A-to-I and C-to-U\u0000editing are currently the most predominant methods of RNA modification. CRISPR\u0000stands out as the most cost-effective and customizable technology for both RNA\u0000and gene editing. Unlike permanent changes induced by cutting an individual's\u0000DNA genetic code, RNA editing offers temporary modifications by altering\u0000nucleoside bases in RNA strands, which can then attach to DNA strands as\u0000temporary modifiers.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis
The rapid advancement of single-cell ATAC sequencing (scATAC-seq) technologies holds great promise for investigating the heterogeneity of epigenetic landscapes at the cellular level. The amplification process in scATAC-seq experiments often introduces noise due to dropout events, which results in extreme sparsity that hinders accurate analysis. Consequently, there is a significant demand for the generation of high-quality scATAC-seq data in silico. Furthermore, current methodologies are typically task-specific, lacking a versatile framework capable of handling multiple tasks within a single model. In this work, we propose ATAC-Diff, a versatile framework, which is based on a latent diffusion model conditioned on the latent auxiliary variables to adapt for various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq data generation and analysis, composed of auxiliary modules encoding the latent high-level variables to enable the model to learn the semantic information to sample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and auxiliary decoder, the yield variables reserve the refined genomic information beneficial for downstream analyses. Another innovation is the incorporation of mutual information between observed and hidden variables as a regularization term to prevent the model from decoupling from latent variables. Through extensive experiments, we demonstrate that ATAC-Diff achieves high performance in both generation and analysis tasks, outperforming state-of-the-art models.
{"title":"A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis","authors":"Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis","doi":"arxiv-2408.14801","DOIUrl":"https://doi.org/arxiv-2408.14801","url":null,"abstract":"The rapid advancement of single-cell ATAC sequencing (scATAC-seq)\u0000technologies holds great promise for investigating the heterogeneity of\u0000epigenetic landscapes at the cellular level. The amplification process in\u0000scATAC-seq experiments often introduces noise due to dropout events, which\u0000results in extreme sparsity that hinders accurate analysis. Consequently, there\u0000is a significant demand for the generation of high-quality scATAC-seq data in\u0000silico. Furthermore, current methodologies are typically task-specific, lacking\u0000a versatile framework capable of handling multiple tasks within a single model.\u0000In this work, we propose ATAC-Diff, a versatile framework, which is based on a\u0000latent diffusion model conditioned on the latent auxiliary variables to adapt\u0000for various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq\u0000data generation and analysis, composed of auxiliary modules encoding the latent\u0000high-level variables to enable the model to learn the semantic information to\u0000sample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and\u0000auxiliary decoder, the yield variables reserve the refined genomic information\u0000beneficial for downstream analyses. Another innovation is the incorporation of\u0000mutual information between observed and hidden variables as a regularization\u0000term to prevent the model from decoupling from latent variables. Through\u0000extensive experiments, we demonstrate that ATAC-Diff achieves high performance\u0000in both generation and analysis tasks, outperforming state-of-the-art models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Guna Ranjan Gurazada, Hannah M. Kennedy, Richard D. Braatz, Steven J. Mehrman, Shawn W. Polson, Irene T. Rombel
Gene therapy is poised to transition from niche to mainstream medicine, with recombinant adeno-associated virus (rAAV) as the vector of choice. However, this requires robust, scalable, industrialized production to meet demand and provide affordable patient access, which has thus far failed to materialize. Closing the chasm between demand and supply requires innovation in biomanufacturing to achieve the essential step change in rAAV product yield and quality. Omics provides a rich source of mechanistic knowledge that can be applied to HEK293, the prevailing cell line for rAAV production. In this review, the findings from a growing number of disparate studies that apply genomics, epigenomics, transcriptomics, proteomics, and metabolomics to HEK293 bioproduction are explored. Learnings from CHO-Omics, application of omics approaches to improve CHO bioproduction, provide context for the potential of "HEK-Omics" as a multiomics-informed approach providing actionable mechanistic insights for improved transient and stable production of rAAV and other recombinant products in HEK293.
{"title":"HEK-Omics: The promise of omics to optimize HEK293 for recombinant adeno-associated virus (rAAV) gene therapy manufacturing","authors":"Sai Guna Ranjan Gurazada, Hannah M. Kennedy, Richard D. Braatz, Steven J. Mehrman, Shawn W. Polson, Irene T. Rombel","doi":"arxiv-2408.13374","DOIUrl":"https://doi.org/arxiv-2408.13374","url":null,"abstract":"Gene therapy is poised to transition from niche to mainstream medicine, with\u0000recombinant adeno-associated virus (rAAV) as the vector of choice. However,\u0000this requires robust, scalable, industrialized production to meet demand and\u0000provide affordable patient access, which has thus far failed to materialize.\u0000Closing the chasm between demand and supply requires innovation in\u0000biomanufacturing to achieve the essential step change in rAAV product yield and\u0000quality. Omics provides a rich source of mechanistic knowledge that can be\u0000applied to HEK293, the prevailing cell line for rAAV production. In this\u0000review, the findings from a growing number of disparate studies that apply\u0000genomics, epigenomics, transcriptomics, proteomics, and metabolomics to HEK293\u0000bioproduction are explored. Learnings from CHO-Omics, application of omics\u0000approaches to improve CHO bioproduction, provide context for the potential of\u0000\"HEK-Omics\" as a multiomics-informed approach providing actionable mechanistic\u0000insights for improved transient and stable production of rAAV and other\u0000recombinant products in HEK293.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"388 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Zhang, Li Xiao, Peng Qi, Yaling Zeng, Xumeng Chen, Duan-fang Liao, Kai Li
Hi-C sequencing is widely used for analyzing chromosomal interactions. In this study, we propose "superimposed Hi-C" which features paired EcoP15I sites in a linker to facilitate sticky-end ligation with target DNAs. Superimposed Hi-C overcomes Hi-C's technical limitations, enabling the identification of single cell's chromosomal interactions.
{"title":"Superimposed Hi-C: A Solution Proposed for Identifying Single Cell's Chromosomal Interactions","authors":"Jia Zhang, Li Xiao, Peng Qi, Yaling Zeng, Xumeng Chen, Duan-fang Liao, Kai Li","doi":"arxiv-2408.13039","DOIUrl":"https://doi.org/arxiv-2408.13039","url":null,"abstract":"Hi-C sequencing is widely used for analyzing chromosomal interactions. In\u0000this study, we propose \"superimposed Hi-C\" which features paired EcoP15I sites\u0000in a linker to facilitate sticky-end ligation with target DNAs. Superimposed\u0000Hi-C overcomes Hi-C's technical limitations, enabling the identification of\u0000single cell's chromosomal interactions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changes in the number of copies of certain parts of the genome, known as copy number alterations (CNAs), due to somatic mutation processes are a hallmark of many cancers. This genomic complexity is known to be associated with poorer outcomes for patients but describing its contribution in detail has been difficult. Copy number alterations can affect large regions spanning whole chromosomes or the entire genome itself but can also be localised to only small segments of the genome and no methods exist that allow this multi-scale nature to be quantified. In this paper, we address this using Wave-LSTM, a signal decomposition approach designed to capture the multi-scale structure of complex whole genome copy number profiles. Using wavelet-based source separation in combination with deep learning-based attention mechanisms. We show that Wave-LSTM can be used to derive multi-scale representations from copy number profiles which can be used to decipher sub-clonal structures from single-cell copy number data and to improve survival prediction performance from patient tumour profiles.
{"title":"Wave-LSTM: Multi-scale analysis of somatic whole genome copy number profiles","authors":"Charles Gadd, Christopher Yau","doi":"arxiv-2408.12636","DOIUrl":"https://doi.org/arxiv-2408.12636","url":null,"abstract":"Changes in the number of copies of certain parts of the genome, known as copy\u0000number alterations (CNAs), due to somatic mutation processes are a hallmark of\u0000many cancers. This genomic complexity is known to be associated with poorer\u0000outcomes for patients but describing its contribution in detail has been\u0000difficult. Copy number alterations can affect large regions spanning whole\u0000chromosomes or the entire genome itself but can also be localised to only small\u0000segments of the genome and no methods exist that allow this multi-scale nature\u0000to be quantified. In this paper, we address this using Wave-LSTM, a signal\u0000decomposition approach designed to capture the multi-scale structure of complex\u0000whole genome copy number profiles. Using wavelet-based source separation in\u0000combination with deep learning-based attention mechanisms. We show that\u0000Wave-LSTM can be used to derive multi-scale representations from copy number\u0000profiles which can be used to decipher sub-clonal structures from single-cell\u0000copy number data and to improve survival prediction performance from patient\u0000tumour profiles.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell analysis is an increasingly relevant approach in "omics'' studies. In the last decade, it has been applied to various fields, including cancer biology, neuroscience, and, especially, developmental biology. This rise in popularity has been accompanied with creation of modern software, development of new pipelines and design of new algorithms. Many established algorithms have also been applied with varying levels of effectiveness. Currently, there is an abundance of algorithms for all steps of the general workflow. While some scientists use ready-made pipelines (such as Seurat), manual analysis is popular, too, as it allows more flexibility. Scientists who perform their own analysis face multiple options when it comes to the choice of algorithms. We have used two different datasets to test some of the most widely-used algorithms. In this paper, we are going to report the main differences between them, suggest a minimal number of algorithms for each step, and explain our suggestions. In certain stages, it is impossible to make a clear choice without further context. In these cases, we are going to explore the major possibilities, and make suggestions for each one of them.
{"title":"Comparison of algorithms used in single-cell transcriptomic data analysis","authors":"Jafar Isbarov, Elmir Mahammadov","doi":"arxiv-2408.12031","DOIUrl":"https://doi.org/arxiv-2408.12031","url":null,"abstract":"Single-cell analysis is an increasingly relevant approach in \"omics''\u0000studies. In the last decade, it has been applied to various fields, including\u0000cancer biology, neuroscience, and, especially, developmental biology. This rise\u0000in popularity has been accompanied with creation of modern software,\u0000development of new pipelines and design of new algorithms. Many established\u0000algorithms have also been applied with varying levels of effectiveness.\u0000Currently, there is an abundance of algorithms for all steps of the general\u0000workflow. While some scientists use ready-made pipelines (such as Seurat),\u0000manual analysis is popular, too, as it allows more flexibility. Scientists who\u0000perform their own analysis face multiple options when it comes to the choice of\u0000algorithms. We have used two different datasets to test some of the most\u0000widely-used algorithms. In this paper, we are going to report the main\u0000differences between them, suggest a minimal number of algorithms for each step,\u0000and explain our suggestions. In certain stages, it is impossible to make a\u0000clear choice without further context. In these cases, we are going to explore\u0000the major possibilities, and make suggestions for each one of them.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies enables the investigation of cellular-level tissue heterogeneity. Cell annotation significantly contributes to the extensive downstream analysis of scRNA-seq data. However, The analysis of scRNA-seq for biological inference presents challenges owing to its intricate and indeterminate data distribution, characterized by a substantial volume and a high frequency of dropout events. Furthermore, the quality of training samples varies greatly, and the performance of the popular scRNA-seq data clustering solution GNN could be harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2) nodes that contribute little additional information to the graph. To address these problems, we propose a single-cell curriculum learning-based deep graph embedding clustering (scCLG). We first propose a Chebyshev graph convolutional autoencoder with multi-decoder (ChebAE) that combines three optimization objectives corresponding to three decoders, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation. Meanwhile, we employ a selective training strategy to train GNN based on the features and entropy of nodes and prune the difficult nodes based on the difficulty scores to keep the high-quality graph. Empirical results on a variety of gene expression datasets show that our model outperforms state-of-the-art methods.
{"title":"Single-cell Curriculum Learning-based Deep Graph Embedding Clustering","authors":"Huifa Li, Jie Fu, Xinpeng Ling, Zhiyu Sun, Kuncan Wang, Zhili Chen","doi":"arxiv-2408.10511","DOIUrl":"https://doi.org/arxiv-2408.10511","url":null,"abstract":"The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies\u0000enables the investigation of cellular-level tissue heterogeneity. Cell\u0000annotation significantly contributes to the extensive downstream analysis of\u0000scRNA-seq data. However, The analysis of scRNA-seq for biological inference\u0000presents challenges owing to its intricate and indeterminate data distribution,\u0000characterized by a substantial volume and a high frequency of dropout events.\u0000Furthermore, the quality of training samples varies greatly, and the\u0000performance of the popular scRNA-seq data clustering solution GNN could be\u0000harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2)\u0000nodes that contribute little additional information to the graph. To address\u0000these problems, we propose a single-cell curriculum learning-based deep graph\u0000embedding clustering (scCLG). We first propose a Chebyshev graph convolutional\u0000autoencoder with multi-decoder (ChebAE) that combines three optimization\u0000objectives corresponding to three decoders, including topology reconstruction\u0000loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and\u0000clustering loss, to learn cell-cell topology representation. Meanwhile, we\u0000employ a selective training strategy to train GNN based on the features and\u0000entropy of nodes and prune the difficult nodes based on the difficulty scores\u0000to keep the high-quality graph. Empirical results on a variety of gene\u0000expression datasets show that our model outperforms state-of-the-art methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arya Hadizadeh Moghaddam, Mohsen Nayebi Kerdabadi, Cuncong Zhong, Zijun Yao
Gene expression profiles obtained through DNA microarray have proven successful in providing critical information for cancer detection classifiers. However, the limited number of samples in these datasets poses a challenge to employ complex methodologies such as deep neural networks for sophisticated analysis. To address this "small data" dilemma, Meta-Learning has been introduced as a solution to enhance the optimization of machine learning models by utilizing similar datasets, thereby facilitating a quicker adaptation to target datasets without the requirement of sufficient samples. In this study, we present a meta-learning-based approach for predicting lung cancer from gene expression profiles. We apply this framework to well-established deep learning methodologies and employ four distinct datasets for the meta-learning tasks, where one as the target dataset and the rest as source datasets. Our approach is evaluated against both traditional and deep learning methodologies, and the results show the superior performance of meta-learning on augmented source data compared to the baselines trained on single datasets. Moreover, we conduct the comparative analysis between meta-learning and transfer learning methodologies to highlight the efficiency of the proposed approach in addressing the challenges associated with limited sample sizes. Finally, we incorporate the explainability study to illustrate the distinctiveness of decisions made by meta-learning.
通过 DNA 微阵列获得的基因表达谱已被证明能成功地为癌症检测分类器提供关键信息。然而,这些数据集中的样本数量有限,这对采用深度神经网络等复杂方法进行精密分析构成了挑战。为了解决这种 "小数据 "困境,元学习被引入作为一种解决方案,通过利用相似数据集来加强机器学习模型的优化,从而在不需要足够样本的情况下更快地适应目标数据集。在本研究中,我们提出了一种基于元学习的方法,用于从基因表达谱预测肺癌。我们将这一框架应用于成熟的深度学习方法,并采用四个不同的数据集来完成元学习任务,其中一个作为目标数据集,其余的作为源数据集。我们的方法与传统方法和深度学习方法进行了对比评估,结果表明元学习在增强源数据上的性能优于在单一数据集上训练的基线。此外,我们还对元学习和迁移学习方法进行了比较分析,以突出所提方法在解决有限样本量相关挑战方面的效率。最后,我们纳入了可解释性研究,以说明元学习所做决策的独特性。
{"title":"Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection","authors":"Arya Hadizadeh Moghaddam, Mohsen Nayebi Kerdabadi, Cuncong Zhong, Zijun Yao","doi":"arxiv-2408.09635","DOIUrl":"https://doi.org/arxiv-2408.09635","url":null,"abstract":"Gene expression profiles obtained through DNA microarray have proven\u0000successful in providing critical information for cancer detection classifiers.\u0000However, the limited number of samples in these datasets poses a challenge to\u0000employ complex methodologies such as deep neural networks for sophisticated\u0000analysis. To address this \"small data\" dilemma, Meta-Learning has been\u0000introduced as a solution to enhance the optimization of machine learning models\u0000by utilizing similar datasets, thereby facilitating a quicker adaptation to\u0000target datasets without the requirement of sufficient samples. In this study,\u0000we present a meta-learning-based approach for predicting lung cancer from gene\u0000expression profiles. We apply this framework to well-established deep learning\u0000methodologies and employ four distinct datasets for the meta-learning tasks,\u0000where one as the target dataset and the rest as source datasets. Our approach\u0000is evaluated against both traditional and deep learning methodologies, and the\u0000results show the superior performance of meta-learning on augmented source data\u0000compared to the baselines trained on single datasets. Moreover, we conduct the\u0000comparative analysis between meta-learning and transfer learning methodologies\u0000to highlight the efficiency of the proposed approach in addressing the\u0000challenges associated with limited sample sizes. Finally, we incorporate the\u0000explainability study to illustrate the distinctiveness of decisions made by\u0000meta-learning.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Selim Romero, Shreyan Gupta, Victoria Gatlin, Robert S. Chapkin, James J. Cai
Feature selection is vital for identifying relevant variables in classification and regression models, especially in single-cell RNA sequencing (scRNA-seq) data analysis. Traditional methods like LASSO often struggle with the nonlinearities and multicollinearities in scRNA-seq data due to complex gene expression and extensive gene interactions. Quantum annealing, a form of quantum computing, offers a promising solution. In this study, we apply quantum annealing-empowered quadratic unconstrained binary optimization (QUBO) for feature selection in scRNA-seq data. Using data from a human cell differentiation system, we show that QUBO identifies genes with nonlinear expression patterns related to differentiation time, many of which play roles in the differentiation process. In contrast, LASSO tends to select genes with more linear expression changes. Our findings suggest that the QUBO method, powered by quantum annealing, can reveal complex gene expression patterns that traditional methods might overlook, enhancing scRNA-seq data analysis and interpretation.
{"title":"Quantum Annealing for Enhanced Feature Selection in Single-Cell RNA Sequencing Data Analysis","authors":"Selim Romero, Shreyan Gupta, Victoria Gatlin, Robert S. Chapkin, James J. Cai","doi":"arxiv-2408.08867","DOIUrl":"https://doi.org/arxiv-2408.08867","url":null,"abstract":"Feature selection is vital for identifying relevant variables in\u0000classification and regression models, especially in single-cell RNA sequencing\u0000(scRNA-seq) data analysis. Traditional methods like LASSO often struggle with\u0000the nonlinearities and multicollinearities in scRNA-seq data due to complex\u0000gene expression and extensive gene interactions. Quantum annealing, a form of\u0000quantum computing, offers a promising solution. In this study, we apply quantum\u0000annealing-empowered quadratic unconstrained binary optimization (QUBO) for\u0000feature selection in scRNA-seq data. Using data from a human cell\u0000differentiation system, we show that QUBO identifies genes with nonlinear\u0000expression patterns related to differentiation time, many of which play roles\u0000in the differentiation process. In contrast, LASSO tends to select genes with\u0000more linear expression changes. Our findings suggest that the QUBO method,\u0000powered by quantum annealing, can reveal complex gene expression patterns that\u0000traditional methods might overlook, enhancing scRNA-seq data analysis and\u0000interpretation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The application of machine learning to transcriptomics data has led to significant advances in cancer research. However, the high dimensionality and complexity of RNA sequencing (RNA-seq) data pose significant challenges in pan-cancer studies. This study hypothesizes that gene sets derived from single-cell RNA sequencing (scRNA-seq) data will outperform those selected using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene co-expression network analysis (hdWGCNA) was performed to identify relevant gene sets, which were further refined using XGBoost for feature selection. These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq data and compared to six reference gene sets and oncogenes from OncoKB evaluated with deep learning models, including multilayer perceptrons (MLPs) and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set demonstrated higher performance in most tasks, including tumor mutation burden assessment, microsatellite instability classification, mutation prediction, cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently significant across tasks. This study presents a robust approach for feature selection in cancer genomics by integrating scRNA-seq data and advanced analysis techniques, offering a promising avenue for improving predictive accuracy in cancer research.
{"title":"Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks","authors":"Jong Hyun Kim, Jongseong Jang","doi":"arxiv-2408.07233","DOIUrl":"https://doi.org/arxiv-2408.07233","url":null,"abstract":"The application of machine learning to transcriptomics data has led to\u0000significant advances in cancer research. However, the high dimensionality and\u0000complexity of RNA sequencing (RNA-seq) data pose significant challenges in\u0000pan-cancer studies. This study hypothesizes that gene sets derived from\u0000single-cell RNA sequencing (scRNA-seq) data will outperform those selected\u0000using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data\u0000from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene\u0000co-expression network analysis (hdWGCNA) was performed to identify relevant\u0000gene sets, which were further refined using XGBoost for feature selection.\u0000These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq\u0000data and compared to six reference gene sets and oncogenes from OncoKB\u0000evaluated with deep learning models, including multilayer perceptrons (MLPs)\u0000and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set\u0000demonstrated higher performance in most tasks, including tumor mutation burden\u0000assessment, microsatellite instability classification, mutation prediction,\u0000cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and\u0000FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently\u0000significant across tasks. This study presents a robust approach for feature\u0000selection in cancer genomics by integrating scRNA-seq data and advanced\u0000analysis techniques, offering a promising avenue for improving predictive\u0000accuracy in cancer research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}