arXiv - QuanBio - Genomics最新文献

英文中文

Insights, opportunities and challenges provided by large cell atlases 大型细胞图谱提供的见解、机遇和挑战

arXiv - QuanBio - Genomics

Pub Date : 2024-08-13 DOI: arxiv-2408.06563

Martin Hemberg, Federico Marini, Shila Ghazanfar, Ahmad Al Ajami, Najla Abassi, Benedict Anchang, Bérénice A. Benayoun, Yue Cao, Ken Chen, Yesid Cuesta-Astroz, Zach DeBruine, Calliope A. Dendrou, Iwijn De Vlaminck, Katharina Imkeller, Ilya Korsunsky, Alex R. Lederer, Pieter Meysman, Clint Miller, Kerry Mullan, Uwe Ohler, Nikolaos Patikas, Jonas Schuck, Jacqueline HY Siu, Timothy J. Triche Jr., Alex Tsankov, Sander W. van der Laan, Masanao Yajima, Jean Yang, Fabio Zanini, Ivana Jelic

The field of single-cell biology is growing rapidly and is generating largeamounts of data from a variety of species, disease conditions, tissues, andorgans. Coordinated efforts such as CZI CELLxGENE, HuBMAP, Broad InstituteSingle Cell Portal, and DISCO, allow researchers to access large volumes ofcurated datasets. Although the majority of the data is from scRNAseqexperiments, a wide range of other modalities are represented as well. Theseresources have created an opportunity to build and expand the computationalbiology ecosystem to develop tools necessary for data reuse, and for extractingnovel biological insights. Here, we highlight achievements made so far, areaswhere further development is needed, and specific challenges that need to beovercome.

单细胞生物学领域发展迅速，正在产生来自各种物种、疾病、组织和器官的大量数据。在 CZI CELLxGENE、HuBMAP、Broad InstituteSingle Cell Portal 和 DISCO 等机构的协调努力下，研究人员可以访问大量经过整理的数据集。虽然大部分数据来自 scRNAseq 实验，但也有大量其他模式的数据。这些资源为建立和扩展计算生物学生态系统创造了机会，以开发数据再利用和提取新生物学见解所需的工具。在此，我们将重点介绍迄今为止取得的成就、需要进一步发展的领域以及需要克服的具体挑战。

{"title":"Insights, opportunities and challenges provided by large cell atlases","authors":"Martin Hemberg, Federico Marini, Shila Ghazanfar, Ahmad Al Ajami, Najla Abassi, Benedict Anchang, Bérénice A. Benayoun, Yue Cao, Ken Chen, Yesid Cuesta-Astroz, Zach DeBruine, Calliope A. Dendrou, Iwijn De Vlaminck, Katharina Imkeller, Ilya Korsunsky, Alex R. Lederer, Pieter Meysman, Clint Miller, Kerry Mullan, Uwe Ohler, Nikolaos Patikas, Jonas Schuck, Jacqueline HY Siu, Timothy J. Triche Jr., Alex Tsankov, Sander W. van der Laan, Masanao Yajima, Jean Yang, Fabio Zanini, Ivana Jelic","doi":"arxiv-2408.06563","DOIUrl":"https://doi.org/arxiv-2408.06563","url":null,"abstract":"The field of single-cell biology is growing rapidly and is generating large\u0000amounts of data from a variety of species, disease conditions, tissues, and\u0000organs. Coordinated efforts such as CZI CELLxGENE, HuBMAP, Broad Institute\u0000Single Cell Portal, and DISCO, allow researchers to access large volumes of\u0000curated datasets. Although the majority of the data is from scRNAseq\u0000experiments, a wide range of other modalities are represented as well. These\u0000resources have created an opportunity to build and expand the computational\u0000biology ecosystem to develop tools necessary for data reuse, and for extracting\u0000novel biological insights. Here, we highlight achievements made so far, areas\u0000where further development is needed, and specific challenges that need to be\u0000overcome.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pretrained-Guided Conditional Diffusion Models for Microbiome Data Analysis 用于微生物组数据分析的预训练引导条件扩散模型

arXiv - QuanBio - Genomics

Pub Date : 2024-08-10 DOI: arxiv-2408.07709

Xinyuan Shi, Fangfang Zhu, Wenwen Min

Emerging evidence indicates that human cancers are intricately linked tohuman microbiomes, forming an inseparable connection. However, due to limitedsample sizes and significant data loss during collection for various reasons,some machine learning methods have been proposed to address the issue ofmissing data. These methods have not fully utilized the known clinicalinformation of patients to enhance the accuracy of data imputation. Therefore,we introduce mbVDiT, a novel pre-trained conditional diffusion model formicrobiome data imputation and denoising, which uses the unmasked data andpatient metadata as conditional guidance for imputating missing values. It isalso uses VAE to integrate the the other public microbiome datasets to enhancemodel performance. The results on the microbiome datasets from three differentcancer types demonstrate the performance of our methods in comparison withexisting methods.

新的证据表明，人类癌症与人类微生物组之间存在着密不可分的联系。然而，由于样本量有限以及在收集过程中由于各种原因造成的大量数据丢失，人们提出了一些机器学习方法来解决数据丢失的问题。这些方法没有充分利用已知的患者临床信息来提高数据估算的准确性。因此，我们引入了mbVDiT--一种新型的预训练条件扩散模型，用于微生物组数据的估算和去噪。它还使用 VAE 整合其他公共微生物组数据集，以提高模型性能。对三种不同癌症类型的微生物组数据集的研究结果表明，与现有方法相比，我们的方法性能更佳。

引用次数: 0

scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data scASDC：单细胞 RNA-seq 数据的注意力增强型结构深度聚类

arXiv - QuanBio - Genomics

Pub Date : 2024-08-09 DOI: arxiv-2408.05258

Wenwen Min, Zhen Wang, Fangfang Zhu, Taosheng Xu, Shunfang Wang

Single-cell RNA sequencing (scRNA-seq) data analysis is pivotal forunderstanding cellular heterogeneity. However, the high sparsity and complexnoise patterns inherent in scRNA-seq data present significant challenges fortraditional clustering methods. To address these issues, we propose a deepclustering method, Attention-Enhanced Structural Deep Embedding GraphClustering (scASDC), which integrates multiple advanced modules to improveclustering accuracy and robustness.Our approach employs a multi-layer graphconvolutional network (GCN) to capture high-order structural relationshipsbetween cells, termed as the graph autoencoder module. To mitigate theoversmoothing issue in GCNs, we introduce a ZINB-based autoencoder module thatextracts content information from the data and learns latent representations ofgene expression. These modules are further integrated through an attentionfusion mechanism, ensuring effective combination of gene expression andstructural information at each layer of the GCN. Additionally, aself-supervised learning module is incorporated to enhance the robustness ofthe learned embeddings. Extensive experiments demonstrate that scASDCoutperforms existing state-of-the-art methods, providing a robust and effectivesolution for single-cell clustering tasks. Our method paves the way for moreaccurate and meaningful analysis of single-cell RNA sequencing data,contributing to better understanding of cellular heterogeneity and biologicalprocesses. All code and public datasets used in this paper are available aturl{https://github.com/wenwenmin/scASDC} andurl{https://zenodo.org/records/12814320}.

单细胞 RNA 测序（scRNA-seq）数据分析是了解细胞异质性的关键。然而，scRNA-seq 数据固有的高稀疏性和复杂噪声模式给传统聚类方法带来了巨大挑战。为了解决这些问题，我们提出了一种深度聚类方法--注意力增强结构深度嵌入图聚类（scASDC），它集成了多个高级模块，以提高聚类的准确性和鲁棒性。我们的方法采用了多层图卷积网络（GCN）来捕捉细胞之间的高阶结构关系，称为图自动编码器模块。为了缓解 GCN 中的过度平滑问题，我们引入了基于 ZINB 的自动编码器模块，该模块从数据中提取内容信息，并学习基因表达的潜在表征。这些模块通过注意力融合机制进一步整合，确保在 GCN 的每一层都能有效结合基因表达和结构信息。此外，还加入了自我监督学习模块，以增强所学嵌入的鲁棒性。广泛的实验证明，scASDC优于现有的最先进方法，为单细胞聚类任务提供了一种稳健有效的解决方案。我们的方法为更准确、更有意义地分析单细胞 RNA 测序数据铺平了道路，有助于更好地理解细胞异质性和生物过程。本文使用的所有代码和公开数据集可在以下网址获取：url{https://github.com/wenwenmin/scASDC} 和url{https://zenodo.org/records/12814320}。

{"title":"scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data","authors":"Wenwen Min, Zhen Wang, Fangfang Zhu, Taosheng Xu, Shunfang Wang","doi":"arxiv-2408.05258","DOIUrl":"https://doi.org/arxiv-2408.05258","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) data analysis is pivotal for\u0000understanding cellular heterogeneity. However, the high sparsity and complex\u0000noise patterns inherent in scRNA-seq data present significant challenges for\u0000traditional clustering methods. To address these issues, we propose a deep\u0000clustering method, Attention-Enhanced Structural Deep Embedding Graph\u0000Clustering (scASDC), which integrates multiple advanced modules to improve\u0000clustering accuracy and robustness.Our approach employs a multi-layer graph\u0000convolutional network (GCN) to capture high-order structural relationships\u0000between cells, termed as the graph autoencoder module. To mitigate the\u0000oversmoothing issue in GCNs, we introduce a ZINB-based autoencoder module that\u0000extracts content information from the data and learns latent representations of\u0000gene expression. These modules are further integrated through an attention\u0000fusion mechanism, ensuring effective combination of gene expression and\u0000structural information at each layer of the GCN. Additionally, a\u0000self-supervised learning module is incorporated to enhance the robustness of\u0000the learned embeddings. Extensive experiments demonstrate that scASDC\u0000outperforms existing state-of-the-art methods, providing a robust and effective\u0000solution for single-cell clustering tasks. Our method paves the way for more\u0000accurate and meaningful analysis of single-cell RNA sequencing data,\u0000contributing to better understanding of cellular heterogeneity and biological\u0000processes. All code and public datasets used in this paper are available at\u0000url{https://github.com/wenwenmin/scASDC} and\u0000url{https://zenodo.org/records/12814320}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked Graph Autoencoders with Contrastive Augmentation for Spatially Resolved Transcriptomics Data 用于空间解析转录组学数据的具有对比增强功能的屏蔽图自动编码器

arXiv - QuanBio - Genomics

Pub Date : 2024-08-09 DOI: arxiv-2408.06377

Donghai Fang, Fangfang Zhu, Dongting Xie, Wenwen Min

With the rapid advancement of Spatial Resolved Transcriptomics (SRT)technology, it is now possible to comprehensively measure gene transcriptionwhile preserving the spatial context of tissues. Spatial domain identificationand gene denoising are key objectives in SRT data analysis. We propose aContrastively Augmented Masked Graph Autoencoder (STMGAC) to learnlow-dimensional latent representations for domain identification. In the latentspace, persistent signals for representations are obtained throughself-distillation to guide self-supervised matching. At the same time, positiveand negative anchor pairs are constructed using triplet learning to augment thediscriminative ability. We evaluated the performance of STMGAC on fivedatasets, achieving results superior to those of existing baseline methods. Allcode and public datasets used in this paper are available athttps://github.com/wenwenmin/STMGAC and https://zenodo.org/records/13253801.

随着空间分辨转录组学（SRT）技术的快速发展，现在可以在保留组织空间背景的同时全面测量基因转录。空间域识别和基因去噪是 SRT 数据分析的关键目标。我们提出了一种对比增强屏蔽图自动编码器（STMGAC）来学习低维潜在表征，以进行域识别。在潜空间中，通过自我蒸馏获得表征的持续信号，从而指导自我监督匹配。同时，利用三元组学习构建正负锚对，以增强识别能力。我们在五个数据集上评估了 STMGAC 的性能，结果优于现有的基线方法。本文使用的所有代码和公开数据集可在https://github.com/wenwenmin/STMGAC 和 https://zenodo.org/records/13253801 上获取。

引用次数: 0

Heterogeneous graph attention network improves cancer multiomics integration 异构图注意网络改进了癌症多组学整合

arXiv - QuanBio - Genomics

Pub Date : 2024-08-05 DOI: arxiv-2408.02845

Sina Tabakhi, Charlotte Vandermeulen, Ian Sudbery, Haiping Lu

The increase in high-dimensional multiomics data demands advanced integrationmodels to capture the complexity of human diseases. Graph-based deep learningintegration models, despite their promise, struggle with small patient cohortsand high-dimensional features, often applying independent feature selectionwithout modeling relationships among omics. Furthermore, conventionalgraph-based omics models focus on homogeneous graphs, lacking multiple types ofnodes and edges to capture diverse structures. We introduce a HeterogeneousGraph ATtention network for omics integration (HeteroGATomics) to improvecancer diagnosis. HeteroGATomics performs joint feature selection through amulti-agent system, creating dedicated networks of feature and patientsimilarity for each omic modality. These networks are then combined into oneheterogeneous graph for learning holistic omic-specific representations andintegrating predictions across modalities. Experiments on three cancermultiomics datasets demonstrate HeteroGATomics' superior performance in cancerdiagnosis. Moreover, HeteroGATomics enhances interpretability by identifyingimportant biomarkers contributing to the diagnosis outcomes.

高维多组学数据的增加需要先进的整合模型来捕捉人类疾病的复杂性。基于图的深度学习整合模型尽管前景广阔，但在处理小规模患者队列和高维特征时却显得力不从心，通常只应用独立的特征选择，而不对 omics 之间的关系进行建模。此外，传统的基于图的 omics 模型侧重于同质图，缺乏多种类型的节点和边来捕捉多样化的结构。我们介绍了一种用于整合 omics 的异构图 ATtention 网络（HeteroGATomics），以改进癌症诊断。HeteroGATomics 通过多代理系统执行联合特征选择，为每种 omic 模式创建专门的特征和患者相似性网络。然后将这些网络组合成一个异构图，用于学习整体的肿瘤特异性表征和跨模态整合预测。在三个癌症多组学数据集上的实验证明了 HeteroGATomics 在癌症诊断方面的卓越性能。此外，HeteroGATomics 还能识别有助于诊断结果的重要生物标记物，从而提高可解释性。

{"title":"Heterogeneous graph attention network improves cancer multiomics integration","authors":"Sina Tabakhi, Charlotte Vandermeulen, Ian Sudbery, Haiping Lu","doi":"arxiv-2408.02845","DOIUrl":"https://doi.org/arxiv-2408.02845","url":null,"abstract":"The increase in high-dimensional multiomics data demands advanced integration\u0000models to capture the complexity of human diseases. Graph-based deep learning\u0000integration models, despite their promise, struggle with small patient cohorts\u0000and high-dimensional features, often applying independent feature selection\u0000without modeling relationships among omics. Furthermore, conventional\u0000graph-based omics models focus on homogeneous graphs, lacking multiple types of\u0000nodes and edges to capture diverse structures. We introduce a Heterogeneous\u0000Graph ATtention network for omics integration (HeteroGATomics) to improve\u0000cancer diagnosis. HeteroGATomics performs joint feature selection through a\u0000multi-agent system, creating dedicated networks of feature and patient\u0000similarity for each omic modality. These networks are then combined into one\u0000heterogeneous graph for learning holistic omic-specific representations and\u0000integrating predictions across modalities. Experiments on three cancer\u0000multiomics datasets demonstrate HeteroGATomics' superior performance in cancer\u0000diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying\u0000important biomarkers contributing to the diagnosis outcomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Refinement of genetic variants needs attention 需要关注基因变异的完善

arXiv - QuanBio - Genomics

Pub Date : 2024-08-01 DOI: arxiv-2408.00659

Omar Abdelwahab, Davoud Torkamaneh

Variant calling refinement is crucial for distinguishing true geneticvariants from technical artifacts in high-throughput sequencing data. Manualreview is time-consuming while heuristic filtering often lacks optimalsolutions. Traditional variant calling methods often struggle with accuracy,especially in regions of low read coverage, leading to false-positive orfalse-negative calls. Here, we introduce VariantTransformer, aTransformer-based deep learning model, designed to automate variant callingrefinement directly from VCF files in low-coverage data (10-15X).VariantTransformer, trained on two million variants, including SNPs and shortInDels, from low-coverage sequencing data, achieved an accuracy of 89.26% and aROC AUC of 0.88. When integrated into conventional variant calling pipelines,VariantTransformer outperformed traditional heuristic filters and approachedthe performance of state-of-the-art AI-based variant callers like DeepVariant.Comparative analysis demonstrated VariantTransformer's superiority infunctionality, variant type coverage, training size, and input data type.VariantTransformer represents a significant advancement in variant callingrefinement for low-coverage genomic studies.

要从高通量测序数据中区分出真正的遗传变异和技术假象，变异调用的完善至关重要。人工审查非常耗时，而启发式过滤往往缺乏最佳解决方案。传统的变异体调用方法在准确性方面往往存在困难，尤其是在低读数覆盖率区域，从而导致假阳性或假阴性调用。在这里，我们介绍了VariantTransformer，这是一种基于Transformer的深度学习模型，旨在直接从低覆盖率数据（10-15X）的VCF文件中自动进行变体调用细化。VariantTransformer在低覆盖率测序数据的200万个变体（包括SNPs和shortInDels）上进行了训练，准确率达到了89.26%，ROC AUC为0.88。比较分析表明，VariantTransformer 在功能、变异类型覆盖率、训练规模和输入数据类型方面都具有优势。

{"title":"Refinement of genetic variants needs attention","authors":"Omar Abdelwahab, Davoud Torkamaneh","doi":"arxiv-2408.00659","DOIUrl":"https://doi.org/arxiv-2408.00659","url":null,"abstract":"Variant calling refinement is crucial for distinguishing true genetic\u0000variants from technical artifacts in high-throughput sequencing data. Manual\u0000review is time-consuming while heuristic filtering often lacks optimal\u0000solutions. Traditional variant calling methods often struggle with accuracy,\u0000especially in regions of low read coverage, leading to false-positive or\u0000false-negative calls. Here, we introduce VariantTransformer, a\u0000Transformer-based deep learning model, designed to automate variant calling\u0000refinement directly from VCF files in low-coverage data (10-15X).\u0000VariantTransformer, trained on two million variants, including SNPs and short\u0000InDels, from low-coverage sequencing data, achieved an accuracy of 89.26% and a\u0000ROC AUC of 0.88. When integrated into conventional variant calling pipelines,\u0000VariantTransformer outperformed traditional heuristic filters and approached\u0000the performance of state-of-the-art AI-based variant callers like DeepVariant.\u0000Comparative analysis demonstrated VariantTransformer's superiority in\u0000functionality, variant type coverage, training size, and input data type.\u0000VariantTransformer represents a significant advancement in variant calling\u0000refinement for low-coverage genomic studies.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating spatially-resolved transcriptomics data across tissues and individuals: challenges and opportunities 整合跨组织和个体的空间分辨转录组学数据：挑战与机遇

arXiv - QuanBio - Genomics

Pub Date : 2024-08-01 DOI: arxiv-2408.00367

Boyi Guo, Wodan Ling, Sang Ho Kwon, Pratibha Panwar, Shila Ghazanfar, Keri Martinowich, Stephanie C. Hicks

Advances in spatially-resolved transcriptomics (SRT) technologies havepropelled the development of new computational analysis methods to unlockbiological insights. As the cost of generating these data decreases, thesetechnologies provide an exciting opportunity to create large-scale atlases thatintegrate SRT data across multiple tissues, individuals, species, or phenotypesto perform population-level analyses. Here, we describe unique challenges ofvarying spatial resolutions in SRT data, as well as highlight the opportunitiesfor standardized preprocessing methods along with computational algorithmsamenable to atlas-scale datasets leading to improved sensitivity andreproducibility in the future.

空间分辨转录组学（SRT）技术的进步推动了新计算分析方法的发展，从而揭示了生物学的奥秘。随着生成这些数据的成本降低，这些技术提供了一个令人兴奋的机会来创建大规模图谱，整合跨多个组织、个体、物种或表型的 SRT 数据，以进行种群水平的分析。在这里，我们描述了 SRT 数据空间分辨率不同所带来的独特挑战，并强调了标准化预处理方法以及适用于图集规模数据集的计算算法所带来的机遇，从而在未来提高灵敏度和可重复性。

引用次数: 0

UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data UnPaSt：通过 omics 数据中的差异表达双簇对患者进行无监督分层

arXiv - QuanBio - Genomics

Pub Date : 2024-07-31 DOI: arxiv-2408.00200

Michael Hartung, Andreas Maier, Fernando Delgado-Chaves, Yuliya Burankova, Olga I. Isaeva, Fábio Malta de Sá Patroni, Daniel He, Casey Shannon, Katharina Kaufmann, Jens Lohmann, Alexey Savchik, Anne Hartebrodt, Zoe Chervontseva, Farzaneh Firoozbakht, Niklas Probul, Evgenia Zotova, Olga Tsoy, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva

Most complex diseases, including cancer and non-malignant diseases likeasthma, have distinct molecular subtypes that require distinct clinicalapproaches. However, existing computational patient stratification methods havebeen benchmarked almost exclusively on cancer omics data and only perform wellwhen mutually exclusive subtypes can be characterized by many biomarkers. Here,we contribute with a massive evaluation attempt, quantitatively exploring thepower of 22 unsupervised patient stratification methods using both, simulatedand real transcriptome data. From this experience, we developed UnPaSt(https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification,working even with only a limited number of subtype-predictive biomarkers. Weevaluated all 23 methods on real-world breast cancer and asthma transcriptomicsdata. Although many methods reliably detected major breast cancer subtypes,only few identified Th2-high asthma, and UnPaSt significantly outperformed itsclosest competitors in both test datasets. Essentially, we showed that UnPaStcan detect many biologically insightful and reproducible patterns in omicdatasets.

大多数复杂疾病，包括癌症和非恶性疾病（如哮喘），都有不同的分子亚型，需要不同的临床方法。然而，现有的计算患者分层方法几乎都是以癌症组学数据为基准的，只有当相互排斥的亚型可以用许多生物标记物来表征时，这些方法才会表现良好。在此，我们进行了大规模的评估尝试，利用模拟和真实转录组数据定量探索了 22 种无监督患者分层方法的能力。根据这些经验，我们开发了 UnPaSt(https://apps.cosy.bio/unpast/)，优化了无监督患者分层，即使只有有限数量的亚型预测生物标记物也能发挥作用。我们在真实世界的乳腺癌和哮喘转录组学数据上评估了所有 23 种方法。尽管许多方法都能可靠地检测出主要的乳腺癌亚型，但只有少数方法能识别出 Th2 高的哮喘，而 UnPaSt 在这两个测试数据集中的表现明显优于其最接近的竞争对手。从根本上说，我们证明了 UnPaSt 可以检测到 omic 数据集中许多具有生物洞察力且可重复的模式。

{"title":"UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data","authors":"Michael Hartung, Andreas Maier, Fernando Delgado-Chaves, Yuliya Burankova, Olga I. Isaeva, Fábio Malta de Sá Patroni, Daniel He, Casey Shannon, Katharina Kaufmann, Jens Lohmann, Alexey Savchik, Anne Hartebrodt, Zoe Chervontseva, Farzaneh Firoozbakht, Niklas Probul, Evgenia Zotova, Olga Tsoy, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva","doi":"arxiv-2408.00200","DOIUrl":"https://doi.org/arxiv-2408.00200","url":null,"abstract":"Most complex diseases, including cancer and non-malignant diseases like\u0000asthma, have distinct molecular subtypes that require distinct clinical\u0000approaches. However, existing computational patient stratification methods have\u0000been benchmarked almost exclusively on cancer omics data and only perform well\u0000when mutually exclusive subtypes can be characterized by many biomarkers. Here,\u0000we contribute with a massive evaluation attempt, quantitatively exploring the\u0000power of 22 unsupervised patient stratification methods using both, simulated\u0000and real transcriptome data. From this experience, we developed UnPaSt\u0000(https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification,\u0000working even with only a limited number of subtype-predictive biomarkers. We\u0000evaluated all 23 methods on real-world breast cancer and asthma transcriptomics\u0000data. Although many methods reliably detected major breast cancer subtypes,\u0000only few identified Th2-high asthma, and UnPaSt significantly outperformed its\u0000closest competitors in both test datasets. Essentially, we showed that UnPaSt\u0000can detect many biologically insightful and reproducible patterns in omic\u0000datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Are gene-by-environment interactions leveraged in multi-modality neural networks for breast cancer prediction? 多模态神经网络在预测乳腺癌时是否利用了基因与环境的相互作用？

arXiv - QuanBio - Genomics

Pub Date : 2024-07-30 DOI: arxiv-2407.20978

Monica Isgut, Andrew Hornback, Yunan Luo, Asma Khimani, Neha Jain, May D. Wang

Polygenic risk scores (PRSs) can significantly enhance breast cancer riskprediction when combined with clinical risk factor data. While many studieshave explored the value-add of PRSs, little is known about the potential impactof gene-by-gene or gene-by-environment interactions towards enhancing the riskdiscrimination capabilities of multi-modal models combining PRSs with clinicaldata. In this study, we integrated data on 318 individual genotype variantsalong with clinical data in a neural network to explore whether gene-by-gene(i.e., between individual variants) and/or gene-by-environment (betweenclinical risk factors and variants) interactions could be leveraged jointlyduring training to improve breast cancer risk prediction performance. Webenchmarked our approach against a baseline model combining traditionalunivariate PRSs with clinical data in a logistic regression model and ran aninterpretability analysis to identify feature interactions. While our model did not demonstrate improved performance over the baseline,we discovered 248 (<1%) statistically significant gene-by-gene andgene-by-environment interactions out of the ~53.6k possible feature pairs, themost contributory of which included rs6001930 (MKL1) and rs889312 (MAP3K1),with age and menopause being the most heavily interacting non-genetic riskfactors. We also modeled the significant interactions as a network of highlyconnected features, suggesting that potential higher-order interactions arecaptured by the model. Although gene-by-environment (or gene-by-gene)interactions did not enhance breast cancer risk prediction performance inneural networks, our study provides evidence that these interactions can beleveraged by these models to inform their predictions. This study representsthe first application of neural networks to screen for interactions impactingbreast cancer risk using real-world data.

多基因风险评分（PRS）与临床风险因素数据相结合，可显著提高乳腺癌风险预测能力。虽然许多研究都探讨了多基因风险评分的增值作用，但对于基因与基因或基因与环境之间的相互作用对提高结合多基因风险评分和临床数据的多模式模型的风险判别能力的潜在影响却知之甚少。在本研究中，我们在神经网络中整合了 318 个个体基因型变异的数据和临床数据，以探索是否可以在训练过程中联合利用基因与基因（即个体变异之间）和/或基因与环境（临床风险因素与变异之间）的相互作用来提高乳腺癌风险预测性能。我们在逻辑回归模型中结合了传统的单变量PRS和临床数据，并进行了可解释性分析，以确定特征相互作用。与基线模型相比，我们的模型并没有表现出更好的性能，但我们在约 53.6 千个可能的特征对中发现了 248 个（<1%）具有统计学意义的基因间和基因与环境间的相互作用，其中贡献最大的包括 rs6001930 (MKL1) 和 rs889312 (MAP3K1)，年龄和更年期是相互作用最严重的非遗传风险因素。我们还将重要的交互作用建模为一个高度关联的特征网络，这表明该模型捕捉到了潜在的高阶交互作用。虽然基因与环境（或基因与基因）之间的相互作用并没有提高神经网络的乳腺癌风险预测性能，但我们的研究提供了证据，证明这些相互作用可以被这些模型所利用，为其预测提供信息。这项研究代表了神经网络在利用真实世界数据筛选影响乳腺癌风险的相互作用方面的首次应用。

{"title":"Are gene-by-environment interactions leveraged in multi-modality neural networks for breast cancer prediction?","authors":"Monica Isgut, Andrew Hornback, Yunan Luo, Asma Khimani, Neha Jain, May D. Wang","doi":"arxiv-2407.20978","DOIUrl":"https://doi.org/arxiv-2407.20978","url":null,"abstract":"Polygenic risk scores (PRSs) can significantly enhance breast cancer risk\u0000prediction when combined with clinical risk factor data. While many studies\u0000have explored the value-add of PRSs, little is known about the potential impact\u0000of gene-by-gene or gene-by-environment interactions towards enhancing the risk\u0000discrimination capabilities of multi-modal models combining PRSs with clinical\u0000data. In this study, we integrated data on 318 individual genotype variants\u0000along with clinical data in a neural network to explore whether gene-by-gene\u0000(i.e., between individual variants) and/or gene-by-environment (between\u0000clinical risk factors and variants) interactions could be leveraged jointly\u0000during training to improve breast cancer risk prediction performance. We\u0000benchmarked our approach against a baseline model combining traditional\u0000univariate PRSs with clinical data in a logistic regression model and ran an\u0000interpretability analysis to identify feature interactions. While our model did not demonstrate improved performance over the baseline,\u0000we discovered 248 (<1%) statistically significant gene-by-gene and\u0000gene-by-environment interactions out of the ~53.6k possible feature pairs, the\u0000most contributory of which included rs6001930 (MKL1) and rs889312 (MAP3K1),\u0000with age and menopause being the most heavily interacting non-genetic risk\u0000factors. We also modeled the significant interactions as a network of highly\u0000connected features, suggesting that potential higher-order interactions are\u0000captured by the model. Although gene-by-environment (or gene-by-gene)\u0000interactions did not enhance breast cancer risk prediction performance in\u0000neural networks, our study provides evidence that these interactions can be\u0000leveraged by these models to inform their predictions. This study represents\u0000the first application of neural networks to screen for interactions impacting\u0000breast cancer risk using real-world data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera PyamilySeq：用于跨物种和属的可解释基因（再）聚类和泛基因组推断的 Python 工具

arXiv - QuanBio - Genomics

Pub Date : 2024-07-27 DOI: arxiv-2407.19328

Nicholas J. Dimonaco

PyamilySeq is a Python-based tool designed for interpretable gene clusteringand pangenomic inference, supporting analyses at both species and genus levels.It facilitates the clustering of gene sequences into families based on sequencesimilarity using CD-HIT, and can take the output of tried-and-tested sequenceclustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq isdistinctive in its ability to integrate new sequences into existing clusters,providing a robust framework for iterative analysis while preserving theoriginal clusters, useful when reannotating genomes. In addition to thestandard Species mode which as with other tools performs core-gene analysisacross a species range, PyamilySeq can be run in Genus mode where it detectsthe presence of gene families shared across multiple genera. These featuresenhance the tools applicability for ongoing and past genomic studies andcomparative analyses. PyamilySeq generates comprehensive outputs, includinggene presence-absence matrices and aligned sequence data, enabling downstreamanalysis and interpretation of the identified gene groups and pangenomic data.

PyamilySeq 是一款基于 Python- 的工具，设计用于可解释的基因聚类和泛基因组推断，支持物种和种属水平的分析。它可以使用 CD-HIT，根据序列相似性将基因序列聚类为科，并可以使用 CD-HIT、BLAST、DIAMOND 和 MMseqs2 等久经考验的序列聚类工具的输出结果。PyamilySeq 的独特之处在于它能将新序列整合到现有聚类中，为迭代分析提供了一个稳健的框架，同时保留了原始聚类，这在重新标注基因组时非常有用。与其他工具一样，PyamilySeq 除了在标准的 "物种 "模式下进行跨物种核心基因分析外，还可以在 "属 "模式下运行，检测是否存在跨属共享的基因家族。这些功能增强了该工具在当前和过去的基因组研究和比较分析中的适用性。PyamilySeq 可生成全面的输出结果，包括基因存在-不存在矩阵和对齐的序列数据，从而可对已识别的基因组和泛基因组数据进行下游分析和解释。

{"title":"PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera","authors":"Nicholas J. Dimonaco","doi":"arxiv-2407.19328","DOIUrl":"https://doi.org/arxiv-2407.19328","url":null,"abstract":"PyamilySeq is a Python-based tool designed for interpretable gene clustering\u0000and pangenomic inference, supporting analyses at both species and genus levels.\u0000It facilitates the clustering of gene sequences into families based on sequence\u0000similarity using CD-HIT, and can take the output of tried-and-tested sequence\u0000clustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is\u0000distinctive in its ability to integrate new sequences into existing clusters,\u0000providing a robust framework for iterative analysis while preserving the\u0000original clusters, useful when reannotating genomes. In addition to the\u0000standard Species mode which as with other tools performs core-gene analysis\u0000across a species range, PyamilySeq can be run in Genus mode where it detects\u0000the presence of gene families shared across multiple genera. These features\u0000enhance the tools applicability for ongoing and past genomic studies and\u0000comparative analyses. PyamilySeq generates comprehensive outputs, including\u0000gene presence-absence matrices and aligned sequence data, enabling downstream\u0000analysis and interpretation of the identified gene groups and pangenomic data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - QuanBio - Genomics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀