arXiv - QuanBio - Genomics最新文献_第10页

stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation stMCDI：用于空间转录组学数据推算的屏蔽条件扩散模型与图神经网络

arXiv - QuanBio - Genomics

Pub Date : 2024-03-16 DOI: arxiv-2403.10863

Xiaoyu Li, Wenwen Min, Shunfang Wang, Changmiao Wang, Taosheng Xu

Spatially resolved transcriptomics represents a significant advancement insingle-cell analysis by offering both gene expression data and theircorresponding physical locations. However, this high degree of spatialresolution entails a drawback, as the resulting spatial transcriptomic data atthe cellular level is notably plagued by a high incidence of missing values.Furthermore, most existing imputation methods either overlook the spatialinformation between spots or compromise the overall gene expression datadistribution. To address these challenges, our primary focus is on effectivelyutilizing the spatial location information within spatial transcriptomic datato impute missing values, while preserving the overall data distribution. Weintroduce textbf{stMCDI}, a novel conditional diffusion model for spatialtranscriptomics data imputation, which employs a denoising network trainedusing randomly masked data portions as guidance, with the unmasked data servingas conditions. Additionally, it utilizes a GNN encoder to integrate the spatialposition information, thereby enhancing model performance. The results obtainedfrom spatial transcriptomics datasets elucidate the performance of our methodsrelative to existing approaches.

空间解析转录组学提供了基因表达数据及其相应的物理位置，是单细胞分析的一大进步。然而，这种高度的空间分辨率也有缺点，因为由此产生的细胞水平的空间转录组数据明显受到缺失值发生率高的困扰。此外，大多数现有的估算方法要么忽略了点之间的空间信息，要么损害了整体基因表达数据分布。为了应对这些挑战，我们的主要重点是有效利用空间转录组数据中的空间位置信息来估算缺失值，同时保留整体数据分布。我们引入了一种用于空间转录组学数据估算的新型条件扩散模型--textbf{stMCDI}，该模型采用了以随机屏蔽的数据部分为指导、以未屏蔽的数据为条件训练而成的去噪网络。此外，它还利用 GNN 编码器整合空间位置信息，从而提高了模型性能。从空间转录组学数据集获得的结果阐明了我们的方法相对于现有方法的性能。

{"title":"stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation","authors":"Xiaoyu Li, Wenwen Min, Shunfang Wang, Changmiao Wang, Taosheng Xu","doi":"arxiv-2403.10863","DOIUrl":"https://doi.org/arxiv-2403.10863","url":null,"abstract":"Spatially resolved transcriptomics represents a significant advancement in\u0000single-cell analysis by offering both gene expression data and their\u0000corresponding physical locations. However, this high degree of spatial\u0000resolution entails a drawback, as the resulting spatial transcriptomic data at\u0000the cellular level is notably plagued by a high incidence of missing values.\u0000Furthermore, most existing imputation methods either overlook the spatial\u0000information between spots or compromise the overall gene expression data\u0000distribution. To address these challenges, our primary focus is on effectively\u0000utilizing the spatial location information within spatial transcriptomic data\u0000to impute missing values, while preserving the overall data distribution. We\u0000introduce textbf{stMCDI}, a novel conditional diffusion model for spatial\u0000transcriptomics data imputation, which employs a denoising network trained\u0000using randomly masked data portions as guidance, with the unmasked data serving\u0000as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial\u0000position information, thereby enhancing model performance. The results obtained\u0000from spatial transcriptomics datasets elucidate the performance of our methods\u0000relative to existing approaches.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scVGAE: A Novel Approach using ZINB-Based Variational Graph Autoencoder for Single-Cell RNA-Seq Imputation scVGAE：使用基于 ZINB 的变异图自动编码器进行单细胞 RNA-Seq 估算的新方法

arXiv - QuanBio - Genomics

Pub Date : 2024-03-13 DOI: arxiv-2403.08959

Yoshitaka Inoue

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability tostudy individual cellular distinctions and uncover unique cell characteristics.However, a significant technical challenge in scRNA-seq analysis is theoccurrence of "dropout" events, where certain gene expressions cannot bedetected. This issue is particularly pronounced in genes with low or sparseexpression levels, impacting the precision and interpretability of the obtaineddata. To address this challenge, various imputation methods have beenimplemented to predict such missing values, aiming to enhance the analysis'saccuracy and usefulness. A prevailing hypothesis posits that scRNA-seq dataconforms to a zero-inflated negative binomial (ZINB) distribution.Consequently, methods have been developed to model the data according to thisdistribution. Recent trends in scRNA-seq analysis have seen the emergence ofdeep learning approaches. Some techniques, such as the variational autoencoder,incorporate the ZINB distribution as a model loss function. Graph-based methodslike Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) havealso gained attention as deep learning methodologies for scRNA-seq analysis.This study introduces scVGAE, an innovative approach integrating GCN into avariational autoencoder framework while utilizing a ZINB loss function. Thisintegration presents a promising avenue for effectively addressing dropoutevents in scRNA-seq data, thereby enhancing the accuracy and reliability ofdownstream analyses. scVGAE outperforms other methods in cell clustering, withthe best performance in 11 out of 14 datasets. Ablation study shows allcomponents of scVGAE are necessary. scVGAE is implemented in Python anddownloadable at https://github.com/inoue0426/scVGAE.

单细胞 RNA 测序（scRNA-seq）彻底改变了我们研究单个细胞差异和揭示独特细胞特征的能力。然而，scRNA-seq 分析中的一个重大技术挑战是出现 "脱落 "事件，即无法检测到某些基因的表达。这一问题在表达水平较低或稀少的基因中尤为突出，影响了所获数据的精确性和可解释性。为了应对这一挑战，人们采用了各种估算方法来预测这类缺失值，以提高分析的准确性和实用性。一种流行的假设认为，scRNA-seq 数据符合零膨胀负二项分布（ZINB）。最近，scRNA-seq 分析领域出现了深度学习方法。一些技术（如变异自动编码器）将 ZINB 分布作为模型损失函数。图卷积网络（Graph Convolutional Networks，GCN）和图注意力网络（Graph Attention Networks，GAT）等基于图的方法作为用于 scRNA-seq 分析的深度学习方法也受到了关注。本研究介绍了 scVGAE，这是一种将 GCN 集成到变异自动编码器框架中的创新方法，同时利用了 ZINB 损失函数。scVGAE 在细胞聚类方面的表现优于其他方法，在 14 个数据集中的 11 个数据集中表现最佳。消融研究表明 scVGAE 的所有组件都是必要的。scVGAE 用 Python 实现，可在 https://github.com/inoue0426/scVGAE 下载。

{"title":"scVGAE: A Novel Approach using ZINB-Based Variational Graph Autoencoder for Single-Cell RNA-Seq Imputation","authors":"Yoshitaka Inoue","doi":"arxiv-2403.08959","DOIUrl":"https://doi.org/arxiv-2403.08959","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to\u0000study individual cellular distinctions and uncover unique cell characteristics.\u0000However, a significant technical challenge in scRNA-seq analysis is the\u0000occurrence of \"dropout\" events, where certain gene expressions cannot be\u0000detected. This issue is particularly pronounced in genes with low or sparse\u0000expression levels, impacting the precision and interpretability of the obtained\u0000data. To address this challenge, various imputation methods have been\u0000implemented to predict such missing values, aiming to enhance the analysis's\u0000accuracy and usefulness. A prevailing hypothesis posits that scRNA-seq data\u0000conforms to a zero-inflated negative binomial (ZINB) distribution.\u0000Consequently, methods have been developed to model the data according to this\u0000distribution. Recent trends in scRNA-seq analysis have seen the emergence of\u0000deep learning approaches. Some techniques, such as the variational autoencoder,\u0000incorporate the ZINB distribution as a model loss function. Graph-based methods\u0000like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) have\u0000also gained attention as deep learning methodologies for scRNA-seq analysis.\u0000This study introduces scVGAE, an innovative approach integrating GCN into a\u0000variational autoencoder framework while utilizing a ZINB loss function. This\u0000integration presents a promising avenue for effectively addressing dropout\u0000events in scRNA-seq data, thereby enhancing the accuracy and reliability of\u0000downstream analyses. scVGAE outperforms other methods in cell clustering, with\u0000the best performance in 11 out of 14 datasets. Ablation study shows all\u0000components of scVGAE are necessary. scVGAE is implemented in Python and\u0000downloadable at https://github.com/inoue0426/scVGAE.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Disentangling shared and private latent factors in multimodal Variational Autoencoders 在多模态变异自动编码器中分辨共享和私有潜在因素

arXiv - QuanBio - Genomics

Pub Date : 2024-03-10 DOI: arxiv-2403.06338

Kaspar Märtens, Christopher Yau

Generative models for multimodal data permit the identification of latentfactors that may be associated with important determinants of observed dataheterogeneity. Common or shared factors could be important for explainingvariation across modalities whereas other factors may be private and importantonly for the explanation of a single modality. Multimodal VariationalAutoencoders, such as MVAE and MMVAE, are a natural choice for inferring thoseunderlying latent factors and separating shared variation from private. In thiswork, we investigate their capability to reliably perform this disentanglement.In particular, we highlight a challenging problem setting wheremodality-specific variation dominates the shared signal. Taking a cross-modalprediction perspective, we demonstrate limitations of existing models, andpropose a modification how to make them more robust to modality-specificvariation. Our findings are supported by experiments on synthetic as well asvarious real-world multi-omics data sets.

多模态数据的生成模型允许识别可能与观察到的数据异质性的重要决定因素相关的潜在因素。共同的或共享的因素可能对解释不同模态的变化很重要，而其他因素可能是私有的，只对解释单一模态很重要。多模态变异自动编码器（如 MVAE 和 MMVAE）是推断潜在因素和区分共享变异与私人变异的自然选择。在这项工作中，我们研究了它们可靠地执行这种分离的能力。特别是，我们强调了一个具有挑战性的问题设置，即特定模态变异在共享信号中占主导地位。从跨模态预测的角度出发，我们展示了现有模型的局限性，并提出了如何使这些模型对特定模态变异更具鲁棒性的修改建议。我们的发现得到了合成数据集和各种真实世界多组学数据集实验的支持。

引用次数: 0

The use of next-generation sequencing in personalized medicine 下一代测序在个性化医疗中的应用

arXiv - QuanBio - Genomics

Pub Date : 2024-03-06 DOI: arxiv-2403.03688

Liya Popova, Valerie J. Carabetta

The revolutionary progress in development of next-generation sequencing (NGS)technologies has made it possible to deliver accurate genomic information in atimely manner. Over the past several years, NGS has transformed biomedical andclinical research and found its application in the field of personalizedmedicine. Here we discuss the rise of personalized medicine and the history ofNGS. We discuss current applications and uses of NGS in medicine, includinginfectious diseases, oncology, genomic medicine, and dermatology. We provide abrief discussion of selected studies where NGS was used to respond to widevariety of questions in biomedical research and clinical medicine. Finally, wediscuss the challenges of implementing NGS into routine clinical use.

下一代测序（NGS）技术的革命性发展使及时提供准确的基因组信息成为可能。在过去几年中，NGS 改变了生物医学和临床研究，并在个性化医疗领域得到了应用。在此，我们将讨论个性化医疗的兴起和 NGS 的历史。我们讨论了 NGS 目前在医学中的应用和用途，包括传染病、肿瘤学、基因组医学和皮肤病学。我们简要讨论了一些利用 NGS 解决生物医学研究和临床医学中各种问题的研究。最后，我们讨论了将 NGS 应用于常规临床的挑战。

引用次数: 0

CRISPR: Ensemble Model CRISPR：集合模型

arXiv - QuanBio - Genomics

Pub Date : 2024-03-05 DOI: arxiv-2403.03018

Mohammad Rostami, Amin Ghariyazi, Hamed Dashti, Mohammad Hossein Rohban, Hamid R. Rabiee

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a geneediting technology that has revolutionized the fields of biology and medicine.However, one of the challenges of using CRISPR is predicting the on-targetefficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This isbecause most existing methods are trained on separate datasets with differentgenes and cells, which limits their generalizability. In this paper, we proposea novel ensemble learning method for sgRNA design that is accurate andgeneralizable. Our method combines the predictions of multiple machine learningmodels to produce a single, more robust prediction. This approach allows us tolearn from a wider range of data, which improves the generalizability of ourmodel. We evaluated our method on a benchmark dataset of sgRNA designs andfound that it outperformed existing methods in terms of both accuracy andgeneralizability. Our results suggest that our method can be used to designsgRNAs with high sensitivity and specificity, even for new genes or cells. Thiscould have important implications for the clinical use of CRISPR, as it wouldallow researchers to design more effective and safer treatments for a varietyof diseases.

然而，使用 CRISPR 的挑战之一是预测单导 RNA（sgRNA）的靶上有效性和脱靶敏感性。这是因为大多数现有方法都是在不同基因和细胞的独立数据集上训练出来的，这限制了它们的通用性。在本文中，我们提出了一种用于 sgRNA 设计的新型集合学习方法，它既准确又具有通用性。我们的方法结合了多个机器学习模型的预测结果，从而得出一个更稳健的预测结果。这种方法允许我们从更广泛的数据中学习，从而提高了我们模型的通用性。我们在 sgRNA 设计的基准数据集上评估了我们的方法，发现它在准确性和通用性方面都优于现有方法。我们的结果表明，我们的方法可用于设计高灵敏度和高特异性的 gRNA，即使是针对新基因或新细胞。这可能会对 CRISPR 的临床应用产生重要影响，因为它能让研究人员为各种疾病设计出更有效、更安全的治疗方法。

{"title":"CRISPR: Ensemble Model","authors":"Mohammad Rostami, Amin Ghariyazi, Hamed Dashti, Mohammad Hossein Rohban, Hamid R. Rabiee","doi":"arxiv-2403.03018","DOIUrl":"https://doi.org/arxiv-2403.03018","url":null,"abstract":"Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) is a gene\u0000editing technology that has revolutionized the fields of biology and medicine.\u0000However, one of the challenges of using CRISPR is predicting the on-target\u0000efficacy and off-target sensitivity of single-guide RNAs (sgRNAs). This is\u0000because most existing methods are trained on separate datasets with different\u0000genes and cells, which limits their generalizability. In this paper, we propose\u0000a novel ensemble learning method for sgRNA design that is accurate and\u0000generalizable. Our method combines the predictions of multiple machine learning\u0000models to produce a single, more robust prediction. This approach allows us to\u0000learn from a wider range of data, which improves the generalizability of our\u0000model. We evaluated our method on a benchmark dataset of sgRNA designs and\u0000found that it outperformed existing methods in terms of both accuracy and\u0000generalizability. Our results suggest that our method can be used to design\u0000sgRNAs with high sensitivity and specificity, even for new genes or cells. This\u0000could have important implications for the clinical use of CRISPR, as it would\u0000allow researchers to design more effective and safer treatments for a variety\u0000of diseases.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"271 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140045146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A genome-scale deep learning model to predict gene expression changes of genetic perturbations from multiplex biological networks 从多重生物网络预测遗传扰动基因表达变化的基因组尺度深度学习模型

arXiv - QuanBio - Genomics

Pub Date : 2024-03-05 DOI: arxiv-2403.02724

Lingmin Zhan, Yuanyuan Zhang, Yingdong Wang, Aoyi Wang, Caiping Cheng, Jinzhong Zhao, Wuxia Zhang, Peng Lia, Jianxin Chen

Systematic characterization of biological effects to genetic perturbation isessential to the application of molecular biology and biomedicine. However, theexperimental exhaustion of genetic perturbations on the genome-wide scale ischallenging. Here, we show that TranscriptionNet, a deep learning model thatintegrates multiple biological networks to systematically predicttranscriptional profiles to three types of genetic perturbations based ontranscriptional profiles induced by genetic perturbations in the L1000 project:RNA interference (RNAi), clustered regularly interspaced short palindromicrepeat (CRISPR) and overexpression (OE). TranscriptionNet performs better thanexisting approaches in predicting inducible gene expression changes for allthree types of genetic perturbations. TranscriptionNet can predicttranscriptional profiles for all genes in existing biological networks andincreases perturbational gene expression changes for each type of geneticperturbation from a few thousand to 26,945 genes. TranscriptionNet demonstratesstrong generalization ability when comparing predicted and true gene expressionchanges on different external tasks. Overall, TranscriptionNet can systemicallypredict transcriptional consequences induced by perturbing genes on agenome-wide scale and thus holds promise to systemically detect gene functionand enhance drug development and target discovery.

系统地描述基因扰动对生物的影响对分子生物学和生物医学的应用至关重要。然而，在全基因组范围内对遗传扰动的实验穷举是一项挑战。在这里，我们展示了一个深度学习模型--TranscriptionNet，该模型整合了多个生物网络，根据L1000项目中遗传扰动诱导的转录谱，系统地预测了三种遗传扰动的转录谱：RNA干扰（RNAi）、簇状规则间隔短回文重复（CRISPR）和过表达（OE）。与现有方法相比，TranscriptionNet 在预测所有三类遗传扰动的可诱导基因表达变化方面表现更好。转录网可以预测现有生物网络中所有基因的转录概况，并将每种类型遗传扰动的可诱导基因表达变化从几千个基因增加到 26,945 个基因。在比较不同外部任务的预测基因表达变化和真实基因表达变化时，TranscriptionNet 展示了强大的泛化能力。总之，TranscriptionNet 可以在整个基因组范围内系统地预测扰动基因诱导的转录后果，因此有望系统地检测基因功能，促进药物开发和靶标发现。

{"title":"A genome-scale deep learning model to predict gene expression changes of genetic perturbations from multiplex biological networks","authors":"Lingmin Zhan, Yuanyuan Zhang, Yingdong Wang, Aoyi Wang, Caiping Cheng, Jinzhong Zhao, Wuxia Zhang, Peng Lia, Jianxin Chen","doi":"arxiv-2403.02724","DOIUrl":"https://doi.org/arxiv-2403.02724","url":null,"abstract":"Systematic characterization of biological effects to genetic perturbation is\u0000essential to the application of molecular biology and biomedicine. However, the\u0000experimental exhaustion of genetic perturbations on the genome-wide scale is\u0000challenging. Here, we show that TranscriptionNet, a deep learning model that\u0000integrates multiple biological networks to systematically predict\u0000transcriptional profiles to three types of genetic perturbations based on\u0000transcriptional profiles induced by genetic perturbations in the L1000 project:\u0000RNA interference (RNAi), clustered regularly interspaced short palindromic\u0000repeat (CRISPR) and overexpression (OE). TranscriptionNet performs better than\u0000existing approaches in predicting inducible gene expression changes for all\u0000three types of genetic perturbations. TranscriptionNet can predict\u0000transcriptional profiles for all genes in existing biological networks and\u0000increases perturbational gene expression changes for each type of genetic\u0000perturbation from a few thousand to 26,945 genes. TranscriptionNet demonstrates\u0000strong generalization ability when comparing predicted and true gene expression\u0000changes on different external tasks. Overall, TranscriptionNet can systemically\u0000predict transcriptional consequences induced by perturbing genes on a\u0000genome-wide scale and thus holds promise to systemically detect gene function\u0000and enhance drug development and target discovery.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140045368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine and deep learning methods for predicting 3D genome organization 预测三维基因组组织的机器学习和深度学习方法

arXiv - QuanBio - Genomics

Pub Date : 2024-03-04 DOI: arxiv-2403.03231

Brydon P. G. Wall, My Nguyen, J. Chuck Harrell, Mikhail G. Dozmorov

Three-Dimensional (3D) chromatin interactions, such as enhancer-promoterinteractions (EPIs), loops, Topologically Associating Domains (TADs), and A/Bcompartments play critical roles in a wide range of cellular processes byregulating gene expression. Recent development of chromatin conformationcapture technologies has enabled genome-wide profiling of various 3Dstructures, even with single cells. However, current catalogs of 3D structuresremain incomplete and unreliable due to differences in technology, tools, andlow data resolution. Machine learning methods have emerged as an alternative toobtain missing 3D interactions and/or improve resolution. Such methodsfrequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNAsequencing information (k-mers, Transcription Factor Binding Site (TFBS)motifs), and other genomic properties to learn the associations between genomicfeatures and chromatin interactions. In this review, we discuss computationaltools for predicting three types of 3D interactions (EPIs, chromatininteractions, TAD boundaries) and analyze their pros and cons. We also pointout obstacles of computational prediction of 3D interactions and suggest futureresearch directions.

三维（3D）染色质相互作用，如增强子-启动子相互作用（EPIs）、环路、拓扑关联区（TADs）和A/B区，通过调控基因表达在广泛的细胞过程中发挥着关键作用。染色质构象捕获技术的最新发展使得各种三维结构的全基因组剖析成为可能，即使是单细胞也不例外。然而，由于技术、工具和数据分辨率的差异，目前的三维结构目录仍然不完整、不可靠。机器学习方法已成为获取缺失的三维相互作用和/或提高分辨率的替代方法。这类方法通常使用基因组注释数据（ChIP-seq、DNAse-seq 等）、DNA 测序信息（k-mers、转录因子结合位点（TFBS）motifs）和其他基因组属性来学习基因组特征与染色质相互作用之间的关联。在这篇综述中，我们讨论了预测三种三维相互作用（EPIs、染色质相互作用、TAD边界）的计算工具，并分析了它们的优缺点。我们还指出了三维相互作用计算预测的障碍，并提出了未来的研究方向。

{"title":"Machine and deep learning methods for predicting 3D genome organization","authors":"Brydon P. G. Wall, My Nguyen, J. Chuck Harrell, Mikhail G. Dozmorov","doi":"arxiv-2403.03231","DOIUrl":"https://doi.org/arxiv-2403.03231","url":null,"abstract":"Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter\u0000interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B\u0000compartments play critical roles in a wide range of cellular processes by\u0000regulating gene expression. Recent development of chromatin conformation\u0000capture technologies has enabled genome-wide profiling of various 3D\u0000structures, even with single cells. However, current catalogs of 3D structures\u0000remain incomplete and unreliable due to differences in technology, tools, and\u0000low data resolution. Machine learning methods have emerged as an alternative to\u0000obtain missing 3D interactions and/or improve resolution. Such methods\u0000frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA\u0000sequencing information (k-mers, Transcription Factor Binding Site (TFBS)\u0000motifs), and other genomic properties to learn the associations between genomic\u0000features and chromatin interactions. In this review, we discuss computational\u0000tools for predicting three types of 3D interactions (EPIs, chromatin\u0000interactions, TAD boundaries) and analyze their pros and cons. We also point\u0000out obstacles of computational prediction of 3D interactions and suggest future\u0000research directions.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph-based variant discovery reveals novel dynamics in the human microbiome 基于图谱的变异发现揭示了人类微生物组的新动态

arXiv - QuanBio - Genomics

Pub Date : 2024-03-03 DOI: arxiv-2403.01610

Harihara Subrahmaniam Muralidharan, Jacquelyn S Michaelis, Jay Ghurye, Todd Treangen, Sergey Koren, Marcus Fedarko, Mihai Pop

Sequence differences between the strains of bacteria comprisinghost-associated and environmental microbiota may play a role in communityassembly and influence the resilience of microbial communities to disturbances.Tools for characterizing strain-level variation within microbial communities,however, are limited in scope, focusing on just single nucleotidepolymorphisms, or relying on reference-based analyses that miss complexfunctional and structural variants. Here, we demonstrate the power of assemblygraph analysis to detect and characterize structural variants in almost 1,000metagenomes generated as part of the Human Microbiome Project. We identify overnine million variants comprising insertion/deletion events, repeat copy-numberchanges, and mobile elements such as plasmids. We highlight some of thepotential functional roles of these genomic changes. Our analysis revealedstriking differences in the rate of variation across body sites, highlightingniche-specific mechanisms of bacterial adaptation. The structural variants wedetect also include potentially novel prophage integration events, highlightingthe potential use of graph-based analyses for phage discovery.

然而，表征微生物群落内菌株级变异的工具范围有限，只关注单核苷酸多态性，或者依赖于基于参考文献的分析，从而错过了复杂的功能和结构变异。在这里，我们展示了装配图分析在人类微生物组计划中生成的近 1000 个基因组中检测和描述结构变异的能力。我们发现了九百多万个变异，包括插入/删除事件、重复拷贝数变化和移动元素（如质粒）。我们强调了这些基因组变化的一些潜在功能作用。我们的分析揭示了不同体位变异率的显著差异，突出了细菌适应的特定机制。我们检测到的结构变异还包括潜在的新型噬菌体整合事件，这凸显了基于图的分析在发现噬菌体方面的潜在用途。

{"title":"Graph-based variant discovery reveals novel dynamics in the human microbiome","authors":"Harihara Subrahmaniam Muralidharan, Jacquelyn S Michaelis, Jay Ghurye, Todd Treangen, Sergey Koren, Marcus Fedarko, Mihai Pop","doi":"arxiv-2403.01610","DOIUrl":"https://doi.org/arxiv-2403.01610","url":null,"abstract":"Sequence differences between the strains of bacteria comprising\u0000host-associated and environmental microbiota may play a role in community\u0000assembly and influence the resilience of microbial communities to disturbances.\u0000Tools for characterizing strain-level variation within microbial communities,\u0000however, are limited in scope, focusing on just single nucleotide\u0000polymorphisms, or relying on reference-based analyses that miss complex\u0000functional and structural variants. Here, we demonstrate the power of assembly\u0000graph analysis to detect and characterize structural variants in almost 1,000\u0000metagenomes generated as part of the Human Microbiome Project. We identify over\u0000nine million variants comprising insertion/deletion events, repeat copy-number\u0000changes, and mobile elements such as plasmids. We highlight some of the\u0000potential functional roles of these genomic changes. Our analysis revealed\u0000striking differences in the rate of variation across body sites, highlighting\u0000niche-specific mechanisms of bacterial adaptation. The structural variants we\u0000detect also include potentially novel prophage integration events, highlighting\u0000the potential use of graph-based analyses for phage discovery.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data $Γ$-VAE：在高维数据中发现新兴低维几何结构的曲率正则化变分自动编码器

arXiv - QuanBio - Genomics

Pub Date : 2024-03-02 DOI: arxiv-2403.01078

Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna

Natural systems with emergent behaviors often organize along low-dimensionalsubsets of high-dimensional spaces. For example, despite the tens of thousandsof genes in the human genome, the principled study of genomics is fruitfulbecause biological processes rely on coordinated organization that results inlower dimensional phenotypes. To uncover this organization, many nonlineardimensionality reduction techniques have successfully embedded high-dimensionaldata into low-dimensional spaces by preserving local similarities between datapoints. However, the nonlinearities in these methods allow for too muchcurvature to preserve general trends across multiple non-neighboring dataclusters, thereby limiting their interpretability and generalizability toout-of-distribution data. Here, we address both of these limitations byregularizing the curvature of manifolds generated by variational autoencoders,a process we coin ``$Gamma$-VAE''. We demonstrate its utility using twoexample data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and theGenotype Tissue Expression (GTEx); and single cell RNA-seq from a lineagetracing experiment in hematopoietic stem cell differentiation. We find that theresulting regularized manifolds identify mesoscale structure associated withdifferent cancer cell types, and accurately re-embed tissues from completelyunseen, out-of distribution cancers as if they were originally trained on them.Finally, we show that preserving long-range relationships to differentiatedcells separates undifferentiated cells -- which have not yet specialized --according to their eventual fate. Broadly, we anticipate that regularizing thecurvature of generative models will enable more consistent, predictive, andgeneralizable models in any high-dimensional system with emergentlow-dimensional behavior.

具有突现行为的自然系统往往是沿着高维空间的低维子集组织起来的。例如，尽管人类基因组中有数以万计的基因，但对基因组学的原则性研究却硕果累累，因为生物过程依赖于协调组织，从而产生低维表型。为了揭示这种组织结构，许多非线性降维技术通过保留数据点之间的局部相似性，成功地将高维数据嵌入低维空间。然而，这些方法中的非线性允许过多的曲率，无法保留多个非相邻数据集群的一般趋势，从而限制了它们对分布外数据的可解释性和普适性。在这里，我们通过对变异自动编码器生成的流形的曲率进行规则化来解决这两个局限性，我们称之为"$Gamma$-VAE"。我们使用两个示例数据集证明了这一方法的实用性：来自癌症基因组图谱（TCGA）和基因型组织表达（GTEx）的大容量 RNA-seq；以及来自造血干细胞分化的系谱追踪实验的单细胞 RNA-seq。我们发现，经过正则化处理的流形可以识别与不同癌细胞类型相关的中尺度结构，并能准确地从完全未见的、不在分布范围内的癌症组织中重新嵌入组织，就像最初对它们进行训练一样。最后，我们证明，保留与已分化细胞的长程关系可以根据未分化细胞（尚未特化）的最终命运将它们分开。从广义上讲，我们预计正则化生成模型的曲率将使任何具有新兴低维行为的高维系统中的模型更具一致性、预测性和通用性。

{"title":"$Γ$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional data","authors":"Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein, Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna","doi":"arxiv-2403.01078","DOIUrl":"https://doi.org/arxiv-2403.01078","url":null,"abstract":"Natural systems with emergent behaviors often organize along low-dimensional\u0000subsets of high-dimensional spaces. For example, despite the tens of thousands\u0000of genes in the human genome, the principled study of genomics is fruitful\u0000because biological processes rely on coordinated organization that results in\u0000lower dimensional phenotypes. To uncover this organization, many nonlinear\u0000dimensionality reduction techniques have successfully embedded high-dimensional\u0000data into low-dimensional spaces by preserving local similarities between data\u0000points. However, the nonlinearities in these methods allow for too much\u0000curvature to preserve general trends across multiple non-neighboring data\u0000clusters, thereby limiting their interpretability and generalizability to\u0000out-of-distribution data. Here, we address both of these limitations by\u0000regularizing the curvature of manifolds generated by variational autoencoders,\u0000a process we coin ``$Gamma$-VAE''. We demonstrate its utility using two\u0000example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the\u0000Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage\u0000tracing experiment in hematopoietic stem cell differentiation. We find that the\u0000resulting regularized manifolds identify mesoscale structure associated with\u0000different cancer cell types, and accurately re-embed tissues from completely\u0000unseen, out-of distribution cancers as if they were originally trained on them.\u0000Finally, we show that preserving long-range relationships to differentiated\u0000cells separates undifferentiated cells -- which have not yet specialized --\u0000according to their eventual fate. Broadly, we anticipate that regularizing the\u0000curvature of generative models will enable more consistent, predictive, and\u0000generalizable models in any high-dimensional system with emergent\u0000low-dimensional behavior.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140035804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Gene Regulatory Interaction Networks and predicting therapeutic molecules for Hypopharyngeal Cancer and EGFR-mutated lung adenocarcinoma 探索基因调控相互作用网络并预测下咽癌和表皮生长因子受体突变肺腺癌的治疗分子

arXiv - QuanBio - Genomics

Pub Date : 2024-02-27 DOI: arxiv-2402.17807

Abanti Bhattacharjya, Md Manowarul Islam, Md Ashraf Uddin, Md. Alamin Talukder, AKM Azad, Sunil Aryal, Bikash Kumar Paul, Wahia Tasnim, Muhammad Ali Abdulllah Almoyad, Mohammad Ali Moni

With the advent of Information technology, the Bioinformatics research fieldis becoming increasingly attractive to researchers and academicians. The recentdevelopment of various Bioinformatics toolkits has facilitated the rapidprocessing and analysis of vast quantities of biological data for humanperception. Most studies focus on locating two connected diseases and makingsome observations to construct diverse gene regulatory interaction networks, aforerunner to general drug design for curing illness. For instance,Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lungadenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma andHypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer.To conduct this study, we collect Mircorarray datasets from GEO (GeneExpression Omnibus), an online database controlled by NCBI. Differentiallyexpressed genes, common genes, and hub genes between the selected two diseasesare detected for the succeeding move. Our research findings have suggestedcommon therapeutic molecules for the selected diseases based on 10 hub geneswith the highest interactions according to the degree topology method and themaximum clique centrality (MCC). Our suggested therapeutic molecules will befruitful for patients with those two diseases simultaneously.

随着信息技术的发展，生物信息学研究领域对研究人员和学者的吸引力与日俱增。近年来，各种生物信息学工具包的发展促进了对人类感知的大量生物数据的快速处理和分析。大多数研究的重点是找出两种相互关联的疾病，并通过观察构建多样化的基因调控相互作用网络，进而设计出治疗疾病的通用药物。例如，下咽癌是一种与表皮生长因子受体突变的肺腺癌相关的疾病。为了开展这项研究，我们从 NCBI 控制的在线数据库 GEO（GeneExpression Omnibus）中收集了 Mircorarray 数据集。我们从 NCBI 控制的在线数据库 GEO（GeneExpression Omnibus）中收集了 Mircorarray 数据集，并检测了所选两种疾病之间的差异表达基因、常见基因和枢纽基因，以便进行后续研究。我们的研究结果根据度拓扑法和最大克立中心性（MCC），以相互作用最高的 10 个枢纽基因为基础，提出了所选疾病的共用治疗分子。我们建议的治疗分子将同时对这两种疾病的患者产生疗效。

{"title":"Exploring Gene Regulatory Interaction Networks and predicting therapeutic molecules for Hypopharyngeal Cancer and EGFR-mutated lung adenocarcinoma","authors":"Abanti Bhattacharjya, Md Manowarul Islam, Md Ashraf Uddin, Md. Alamin Talukder, AKM Azad, Sunil Aryal, Bikash Kumar Paul, Wahia Tasnim, Muhammad Ali Abdulllah Almoyad, Mohammad Ali Moni","doi":"arxiv-2402.17807","DOIUrl":"https://doi.org/arxiv-2402.17807","url":null,"abstract":"With the advent of Information technology, the Bioinformatics research field\u0000is becoming increasingly attractive to researchers and academicians. The recent\u0000development of various Bioinformatics toolkits has facilitated the rapid\u0000processing and analysis of vast quantities of biological data for human\u0000perception. Most studies focus on locating two connected diseases and making\u0000some observations to construct diverse gene regulatory interaction networks, a\u0000forerunner to general drug design for curing illness. For instance,\u0000Hypopharyngeal cancer is a disease that is associated with EGFR-mutated lung\u0000adenocarcinoma. In this study, we select EGFR-mutated lung adenocarcinoma and\u0000Hypopharyngeal cancer by finding the Lung metastases in hypopharyngeal cancer.\u0000To conduct this study, we collect Mircorarray datasets from GEO (Gene\u0000Expression Omnibus), an online database controlled by NCBI. Differentially\u0000expressed genes, common genes, and hub genes between the selected two diseases\u0000are detected for the succeeding move. Our research findings have suggested\u0000common therapeutic molecules for the selected diseases based on 10 hub genes\u0000with the highest interactions according to the degree topology method and the\u0000maximum clique centrality (MCC). Our suggested therapeutic molecules will be\u0000fruitful for patients with those two diseases simultaneously.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140004060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0