arXiv - QuanBio - Genomics最新文献_第4页

scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM scGHSOM：利用生长分层 SOM 对单细胞和 CRISPR 数据进行分层聚类和可视化处理

arXiv - QuanBio - Genomics

Pub Date : 2024-07-24 DOI: arxiv-2407.16984

Shang-Jung Wen, Jia-Ming Chang, Fang Yu

High-dimensional single-cell data poses significant challenges in identifyingunderlying biological patterns due to the complexity and heterogeneity ofcellular states. We propose a comprehensive gene-cell dependency visualizationvia unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),specifically designed for analyzing high-dimensional single-cell data likesingle-cell sequencing and CRISPR screens. GHSOM is applied to cluster samplesin a hierarchical structure such that the self-growth structure of clusterssatisfies the required variations between and within. We propose a novelSignificant Attributes Identification Algorithm to identify features thatdistinguish clusters. This algorithm pinpoints attributes with minimalvariation within a cluster but substantial variation between clusters. Thesekey attributes can then be used for targeted data retrieval and downstreamanalysis. Furthermore, we present two innovative visualization tools: ClusterFeature Map and Cluster Distribution Map. The Cluster Feature Map highlightsthe distribution of specific features across the hierarchical structure ofGHSOM clusters. This allows for rapid visual assessment of cluster uniquenessbased on chosen features. The Cluster Distribution Map depicts leaf clusters ascircles on the GHSOM grid, with circle size reflecting cluster data size andcolor customizable to visualize features like cell type or other attributes. Weapply our analysis to three single-cell datasets and one CRISPR dataset(cell-gene database) and evaluate clustering methods with internal and externalCH and ARI scores. GHSOM performs well, being the best performer in internalevaluation (CH=4.2). In external evaluation, GHSOM has the third-bestperformance of all methods.

由于细胞状态的复杂性和异质性，高维单细胞数据给识别潜在的生物模式带来了巨大挑战。我们提出了一种通过无监督聚类实现基因-细胞依赖关系可视化的综合方法--生长分层自组织图（GHSOM），专门用于分析单细胞测序和CRISPR筛选的高维单细胞数据。GHSOM 采用分层结构对样本进行聚类，这样聚类的自生长结构就能满足样本之间和样本内部的变化要求。我们提出了一种新颖的 "重要属性识别算法"（Significant Attributes Identification Algorithm）来识别区分聚类的特征。该算法能找出在聚类内部变化最小，但在聚类之间变化很大的属性。这些关键属性可用于有针对性的数据检索和下游分析。此外，我们还介绍了两种创新的可视化工具：聚类特征图（ClusterFeature Map）和聚类分布图（Cluster Distribution Map）。聚类特征图突出显示了特定特征在 GHSOM 聚类分层结构中的分布。这样就可以根据所选特征快速直观地评估聚类的独特性。簇分布图将叶簇描绘成 GHSOM 网格上的圆圈，圆圈大小反映了簇数据的大小，颜色可自定义，以直观显示细胞类型或其他属性等特征。我们将分析结果应用于三个单细胞数据集和一个 CRISPR 数据集（细胞基因数据库），并用内部、外部CH 和 ARI 分数评估聚类方法。GHSOM 表现出色，是内部评估中表现最好的方法（CH=4.2）。在外部评估中，GHSOM 的表现在所有方法中名列第三。

{"title":"scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM","authors":"Shang-Jung Wen, Jia-Ming Chang, Fang Yu","doi":"arxiv-2407.16984","DOIUrl":"https://doi.org/arxiv-2407.16984","url":null,"abstract":"High-dimensional single-cell data poses significant challenges in identifying\u0000underlying biological patterns due to the complexity and heterogeneity of\u0000cellular states. We propose a comprehensive gene-cell dependency visualization\u0000via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\u0000specifically designed for analyzing high-dimensional single-cell data like\u0000single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\u0000in a hierarchical structure such that the self-growth structure of clusters\u0000satisfies the required variations between and within. We propose a novel\u0000Significant Attributes Identification Algorithm to identify features that\u0000distinguish clusters. This algorithm pinpoints attributes with minimal\u0000variation within a cluster but substantial variation between clusters. These\u0000key attributes can then be used for targeted data retrieval and downstream\u0000analysis. Furthermore, we present two innovative visualization tools: Cluster\u0000Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights\u0000the distribution of specific features across the hierarchical structure of\u0000GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\u0000based on chosen features. The Cluster Distribution Map depicts leaf clusters as\u0000circles on the GHSOM grid, with circle size reflecting cluster data size and\u0000color customizable to visualize features like cell type or other attributes. We\u0000apply our analysis to three single-cell datasets and one CRISPR dataset\u0000(cell-gene database) and evaluate clustering methods with internal and external\u0000CH and ARI scores. GHSOM performs well, being the best performer in internal\u0000evaluation (CH=4.2). In external evaluation, GHSOM has the third-best\u0000performance of all methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning GV-Rep：用于遗传变异表征学习的大规模数据集

arXiv - QuanBio - Genomics

Pub Date : 2024-07-24 DOI: arxiv-2407.16940

Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang

Genetic variants (GVs) are defined as differences in the DNA sequences amongindividuals and play a crucial role in diagnosing and treating geneticdiseases. The rapid decrease in next generation sequencing cost has led to anexponential increase in patient-level GV data. This growth poses a challengefor clinicians who must efficiently prioritize patient-specific GVs andintegrate them with existing genomic databases to inform patient management. Toaddressing the interpretation of GVs, genomic foundation models (GFMs) haveemerged. However, these models lack standardized performance assessments,leading to considerable variability in model evaluations. This poses thequestion: How effectively do deep learning methods classify unknown GVs andalign them with clinically-verified GVs? We argue that representation learning,which transforms raw data into meaningful feature spaces, is an effectiveapproach for addressing both indexing and classification challenges. Weintroduce a large-scale Genetic Variant dataset, named GV-Rep, featuringvariable-length contexts and detailed annotations, designed for deep learningmodels to learn GV representations across various traits, diseases, tissuetypes, and experimental contexts. Our contributions are three-fold: (i)Construction of a comprehensive dataset with 7 million records, each labeledwith characteristics of the corresponding variants, alongside additional datafrom 17,548 gene knockout tests across 1,107 cell types, 1,808 variantcombinations, and 156 unique clinically verified GVs from real-world patients.(ii) Analysis of the structure and properties of the dataset. (iii)Experimentation of the dataset with pre-trained GFMs. The results show asignificant gap between GFMs current capabilities and accurate GVrepresentation. We hope this dataset will help advance genomic deep learning tobridge this gap.

基因变异（GVs）被定义为个体间 DNA 序列的差异，在诊断和治疗遗传疾病中起着至关重要的作用。新一代测序成本的快速降低导致患者水平的 GV 数据呈指数增长。这种增长给临床医生带来了挑战，他们必须有效地优先处理患者特定的 GV，并将其与现有的基因组数据库整合，为患者管理提供信息。为了解决 GV 的解释问题，基因组基础模型（GFM）应运而生。然而，这些模型缺乏标准化的性能评估，导致模型评估存在相当大的差异。这就提出了一个问题：深度学习方法如何有效地对未知 GV 进行分类，并将其与临床验证的 GV 进行对齐？我们认为，将原始数据转换为有意义的特征空间的表征学习是解决索引和分类难题的有效方法。我们引入了一个名为 GV-Rep 的大规模基因变异数据集，该数据集具有可变长度的上下文和详细注释，专为深度学习模型设计，用于学习各种性状、疾病、组织类型和实验上下文中的基因变异表征。我们的贡献包括三个方面：(i) 构建了一个包含 700 万条记录的综合数据集，每条记录都标注了相应变体的特征，此外还有来自 1,107 种细胞类型、1,808 种变体组合的 17,548 个基因敲除测试的数据，以及来自真实世界患者的 156 个经临床验证的独特 GV。(iii)用预先训练的 GFM 对数据集进行实验。结果表明，GFMs 目前的能力与准确的 GV 呈现之间存在明显差距。我们希望这个数据集将有助于推进基因组深度学习，弥补这一差距。

{"title":"GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning","authors":"Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang","doi":"arxiv-2407.16940","DOIUrl":"https://doi.org/arxiv-2407.16940","url":null,"abstract":"Genetic variants (GVs) are defined as differences in the DNA sequences among\u0000individuals and play a crucial role in diagnosing and treating genetic\u0000diseases. The rapid decrease in next generation sequencing cost has led to an\u0000exponential increase in patient-level GV data. This growth poses a challenge\u0000for clinicians who must efficiently prioritize patient-specific GVs and\u0000integrate them with existing genomic databases to inform patient management. To\u0000addressing the interpretation of GVs, genomic foundation models (GFMs) have\u0000emerged. However, these models lack standardized performance assessments,\u0000leading to considerable variability in model evaluations. This poses the\u0000question: How effectively do deep learning methods classify unknown GVs and\u0000align them with clinically-verified GVs? We argue that representation learning,\u0000which transforms raw data into meaningful feature spaces, is an effective\u0000approach for addressing both indexing and classification challenges. We\u0000introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring\u0000variable-length contexts and detailed annotations, designed for deep learning\u0000models to learn GV representations across various traits, diseases, tissue\u0000types, and experimental contexts. Our contributions are three-fold: (i)\u0000Construction of a comprehensive dataset with 7 million records, each labeled\u0000with characteristics of the corresponding variants, alongside additional data\u0000from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant\u0000combinations, and 156 unique clinically verified GVs from real-world patients.\u0000(ii) Analysis of the structure and properties of the dataset. (iii)\u0000Experimentation of the dataset with pre-trained GFMs. The results show a\u0000significant gap between GFMs current capabilities and accurate GV\u0000representation. We hope this dataset will help advance genomic deep learning to\u0000bridge this gap.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction 基于 LSTM 自动编码器的深度神经网络用于大麦基因型到表型预测

arXiv - QuanBio - Genomics

Pub Date : 2024-07-21 DOI: arxiv-2407.16709

Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu

Artificial Intelligence (AI) has emerged as a key driver of precisionagriculture, facilitating enhanced crop productivity, optimized resource use,farm sustainability, and informed decision-making. Also, the expansion ofgenome sequencing technology has greatly increased crop genomic resources,deepening our understanding of genetic variation and enhancing desirable croptraits to optimize performance in various environments. There is increasinginterest in using machine learning (ML) and deep learning (DL) algorithms forgenotype-to-phenotype prediction due to their excellence in capturing complexinteractions within large, high-dimensional datasets. In this work, we proposea new LSTM autoencoder-based model for barley genotype-to-phenotype prediction,specifically for flowering time and grain yield estimation, which couldpotentially help optimize yields and management practices. Our modeloutperformed the other baseline methods, demonstrating its potential inhandling complex high-dimensional agricultural datasets and enhancing cropphenotype prediction performance.

人工智能（AI）已成为精准农业的关键驱动力，有助于提高作物生产力、优化资源利用、农场可持续性和知情决策。此外，基因组测序技术的发展也大大增加了作物基因组资源，加深了我们对遗传变异的理解，提高了作物的理想性状，从而优化了作物在各种环境中的表现。由于机器学习（ML）和深度学习（DL）算法在捕捉大型高维数据集中的复杂相互作用方面表现出色，人们对使用这些算法进行基因型对基因型预测的兴趣与日俱增。在这项工作中，我们提出了一种基于 LSTM 自动编码器的新模型，用于大麦基因型对表型预测，特别是开花时间和谷物产量估计，这可能有助于优化产量和管理实践。我们的模型优于其他基线方法，证明了它在处理复杂的高维农业数据集和提高作物表型预测性能方面的潜力。

{"title":"LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction","authors":"Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu","doi":"arxiv-2407.16709","DOIUrl":"https://doi.org/arxiv-2407.16709","url":null,"abstract":"Artificial Intelligence (AI) has emerged as a key driver of precision\u0000agriculture, facilitating enhanced crop productivity, optimized resource use,\u0000farm sustainability, and informed decision-making. Also, the expansion of\u0000genome sequencing technology has greatly increased crop genomic resources,\u0000deepening our understanding of genetic variation and enhancing desirable crop\u0000traits to optimize performance in various environments. There is increasing\u0000interest in using machine learning (ML) and deep learning (DL) algorithms for\u0000genotype-to-phenotype prediction due to their excellence in capturing complex\u0000interactions within large, high-dimensional datasets. In this work, we propose\u0000a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction,\u0000specifically for flowering time and grain yield estimation, which could\u0000potentially help optimize yields and management practices. Our model\u0000outperformed the other baseline methods, demonstrating its potential in\u0000handling complex high-dimensional agricultural datasets and enhancing crop\u0000phenotype prediction performance.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language 结合 DNA 序列和自然语言的酶功能多模式预测基准数据集

arXiv - QuanBio - Genomics

Pub Date : 2024-07-21 DOI: arxiv-2407.15888

Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun

Predicting gene function from its DNA sequence is a fundamental challenge inbiology. Many deep learning models have been proposed to embed DNA sequencesand predict their enzymatic function, leveraging information in publicdatabases linking DNA sequences to an enzymatic function label. However, muchof the scientific community's knowledge of biological function is notrepresented in these categorical labels, and is instead captured inunstructured text descriptions of mechanisms, reactions, and enzyme behavior.These descriptions are often captured alongside DNA sequences in biologicaldatabases, albeit in an unstructured manner. Deep learning of models predictingenzymatic function are likely to benefit from incorporating this multi-modaldata encoding scientific knowledge of biological function. There is, however,no dataset designed for machine learning algorithms to leverage thismulti-modal information. Here we propose a novel dataset and benchmark suitethat enables the exploration and development of large multi-modal neuralnetwork models on gene DNA sequences and natural language descriptions of genefunction. We present baseline performance on benchmarks for both unsupervisedand supervised tasks that demonstrate the difficulty of this modelingobjective, while demonstrating the potential benefit of incorporatingmulti-modal data types in function prediction compared to DNA sequences alone.Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.

从 DNA 序列预测基因功能是生物学的一项基本挑战。许多深度学习模型被提出来嵌入 DNA 序列并预测其酶功能，利用公共数据库中的信息将 DNA 序列与酶功能标签联系起来。然而，科学界关于生物功能的大部分知识并没有体现在这些分类标签中，而是体现在关于机制、反应和酶行为的非结构化文本描述中。这些描述通常与生物数据库中的 DNA 序列一起被捕获，尽管是以非结构化的方式捕获的。预测酶功能的深度学习模型很可能受益于这些编码生物功能科学知识的多模式数据。然而，目前还没有专为机器学习算法设计的数据集来利用这些多模态信息。在这里，我们提出了一个新颖的数据集和基准套装，它可以在基因 DNA 序列和基因功能自然语言描述上探索和开发大型多模态神经网络模型。我们展示了无监督和有监督任务的基准性能，证明了这一建模目标的难度，同时也证明了与单独的 DNA 序列相比，在功能预测中纳入多模态数据类型的潜在好处。我们的数据集在：https://hoarfrost-lab.github.io/BioTalk/。

{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":"https://doi.org/arxiv-2407.15888","url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\u0000biology. Many deep learning models have been proposed to embed DNA sequences\u0000and predict their enzymatic function, leveraging information in public\u0000databases linking DNA sequences to an enzymatic function label. However, much\u0000of the scientific community's knowledge of biological function is not\u0000represented in these categorical labels, and is instead captured in\u0000unstructured text descriptions of mechanisms, reactions, and enzyme behavior.\u0000These descriptions are often captured alongside DNA sequences in biological\u0000databases, albeit in an unstructured manner. Deep learning of models predicting\u0000enzymatic function are likely to benefit from incorporating this multi-modal\u0000data encoding scientific knowledge of biological function. There is, however,\u0000no dataset designed for machine learning algorithms to leverage this\u0000multi-modal information. Here we propose a novel dataset and benchmark suite\u0000that enables the exploration and development of large multi-modal neural\u0000network models on gene DNA sequences and natural language descriptions of gene\u0000function. We present baseline performance on benchmarks for both unsupervised\u0000and supervised tasks that demonstrate the difficulty of this modeling\u0000objective, while demonstrating the potential benefit of incorporating\u0000multi-modal data types in function prediction compared to DNA sequences alone.\u0000Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq SpaDiT：利用 scRNA-seq 进行空间基因表达预测的扩散变换器

arXiv - QuanBio - Genomics

Pub Date : 2024-07-18 DOI: arxiv-2407.13182

Xiaoyu Li, Fangfang Zhu, Wenwen Min

The rapid development of spatial transcriptomics (ST) technologies isrevolutionizing our understanding of the spatial organization of biologicaltissues. Current ST methods, categorized into next-generation sequencing-based(seq-based) and fluorescence in situ hybridization-based (image-based) methods,offer innovative insights into the functional dynamics of biological tissues.However, these methods are limited by their cellular resolution and thequantity of genes they can detect. To address these limitations, we proposeSpaDiT, a deep learning method that utilizes a diffusion generative model tointegrate scRNA-seq and ST data for the prediction of undetected genes. Byemploying a Transformer-based diffusion model, SpaDiT not only accuratelypredicts unknown genes but also effectively generates the spatial structure ofST genes. We have demonstrated the effectiveness of SpaDiT through extensiveexperiments on both seq-based and image-based ST data. SpaDiT significantlycontributes to ST gene prediction methods with its innovative approach.Compared to eight leading baseline methods, SpaDiT achieved state-of-the-artperformance across multiple metrics, highlighting its substantialbioinformatics contribution.

空间转录组学（ST）技术的快速发展正在彻底改变我们对生物组织空间组织的认识。目前的空间转录组学方法分为基于下一代测序的方法（基于测序）和基于荧光原位杂交的方法（基于图像），这些方法提供了对生物组织功能动态的创新见解。为了解决这些局限性，我们提出了一种深度学习方法SpaDiT，它利用扩散生成模型整合scRNA-seq和ST数据，预测未检测到的基因。通过采用基于变压器的扩散模型，SpaDiT 不仅能准确预测未知基因，还能有效生成 ST 基因的空间结构。我们在基于序列和图像的 ST 数据上进行了大量实验，证明了 SpaDiT 的有效性。与八种领先的基线方法相比，SpaDiT 在多个指标上都达到了最先进的水平，凸显了它在生物信息学方面的巨大贡献。

{"title":"SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq","authors":"Xiaoyu Li, Fangfang Zhu, Wenwen Min","doi":"arxiv-2407.13182","DOIUrl":"https://doi.org/arxiv-2407.13182","url":null,"abstract":"The rapid development of spatial transcriptomics (ST) technologies is\u0000revolutionizing our understanding of the spatial organization of biological\u0000tissues. Current ST methods, categorized into next-generation sequencing-based\u0000(seq-based) and fluorescence in situ hybridization-based (image-based) methods,\u0000offer innovative insights into the functional dynamics of biological tissues.\u0000However, these methods are limited by their cellular resolution and the\u0000quantity of genes they can detect. To address these limitations, we propose\u0000SpaDiT, a deep learning method that utilizes a diffusion generative model to\u0000integrate scRNA-seq and ST data for the prediction of undetected genes. By\u0000employing a Transformer-based diffusion model, SpaDiT not only accurately\u0000predicts unknown genes but also effectively generates the spatial structure of\u0000ST genes. We have demonstrated the effectiveness of SpaDiT through extensive\u0000experiments on both seq-based and image-based ST data. SpaDiT significantly\u0000contributes to ST gene prediction methods with its innovative approach.\u0000Compared to eight leading baseline methods, SpaDiT achieved state-of-the-art\u0000performance across multiple metrics, highlighting its substantial\u0000bioinformatics contribution.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Single-cell 3D genome reconstruction in the haploid setting using rigidity theory 利用刚性理论重建单倍体环境中的单细胞三维基因组

arXiv - QuanBio - Genomics

Pub Date : 2024-07-15 DOI: arxiv-2407.10700

Sean Dewar, Georg Grasegger, Kaie Kubjas, Fatemeh Mohammadi, Anthony Nixon

This article considers the problem of 3-dimensional genome reconstruction forsingle-cell data, and the uniqueness of such reconstructions in the setting ofhaploid organisms. We consider multiple graph models as representations of thisproblem, and use techniques from graph rigidity theory to determineidentifiability. Biologically, our models come from Hi-C data, microscopy data,and combinations thereof. Mathematically, we use unit ball and sphere packingmodels, as well as models consisting of distance and inequality constraints. Ineach setting, we describe and/or derive new results on realisability anduniqueness. We then propose a 3D reconstruction method based on semidefiniteprogramming and apply it to synthetic and real data sets using our models.

本文探讨了单细胞数据的三维基因组重构问题，以及这种重构在单倍体生物中的唯一性。我们将多个图模型视为该问题的表征，并使用图刚度理论的技术来确定可识别性。在生物学上，我们的模型来自 Hi-C 数据、显微镜数据及其组合。在数学上，我们使用单位球和球形堆积模型，以及由距离和不等式约束组成的模型。在每种情况下，我们都会描述和/或推导出关于可实现性和唯一性的新结果。然后，我们提出了一种基于半定义编程的三维重建方法，并利用我们的模型将其应用于合成和真实数据集。

引用次数: 0

OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models 全方位基因组将 RNA 序列与基因组基础模型中的二级结构对齐

arXiv - QuanBio - Genomics

Pub Date : 2024-07-15 DOI: arxiv-2407.11242

Heng Yang, Ke Li

The structures of RNA sequences play a vital role in various cellularprocesses, while existing genomic foundation models (FMs) have struggled withprecise sequence-structure alignment, due to the complexity of exponentialcombinations of nucleotide bases. In this study, we introduce OmniGenome, afoundation model that addresses this critical challenge of sequence-structurealignment in RNA FMs. OmniGenome bridges the sequences with secondarystructures using structure-contextualized modeling, enabling hard in-silicogenomic tasks that existing FMs cannot handle, e.g., RNA design tasks. Theresults on two comprehensive genomic benchmarks show that OmniGenome achievesstate-of-the-art performance on complex RNA subtasks. For example, OmniGenomesolved 74% of complex puzzles, compared to SpliceBERT which solved only 3% ofthe puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour,while the existing methods usually allocate $24$ hours for each puzzle.Overall, OmniGenome establishes wide genomic application cases and offersprofound insights into biological mechanisms from the perspective ofsequence-structure alignment.

RNA 序列的结构在各种细胞过程中起着至关重要的作用，而现有的基因组基础模型（FMs）由于核苷酸碱基指数组合的复杂性，一直难以实现精确的序列-结构比对。在本研究中，我们介绍了 OmniGenome，它是一种基础模型，可以解决 RNA FMs 序列-结构比对的这一关键难题。OmniGenome 利用结构上下文化建模将序列与二级结构连接起来，从而实现现有 FM 无法处理的硅基因组内艰巨任务，例如 RNA 设计任务。两个综合基因组基准测试的结果表明，OmniGenome 在复杂的 RNA 子任务上达到了最先进的性能。例如，OmniGenomes 解决了 74% 的复杂难题，而 SpliceBERT 只解决了 3% 的难题。总之，OmniGenome 建立了广泛的基因组应用案例，并从序列结构比对的角度提供了对生物机制的新见解。

{"title":"OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models","authors":"Heng Yang, Ke Li","doi":"arxiv-2407.11242","DOIUrl":"https://doi.org/arxiv-2407.11242","url":null,"abstract":"The structures of RNA sequences play a vital role in various cellular\u0000processes, while existing genomic foundation models (FMs) have struggled with\u0000precise sequence-structure alignment, due to the complexity of exponential\u0000combinations of nucleotide bases. In this study, we introduce OmniGenome, a\u0000foundation model that addresses this critical challenge of sequence-structure\u0000alignment in RNA FMs. OmniGenome bridges the sequences with secondary\u0000structures using structure-contextualized modeling, enabling hard in-silico\u0000genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The\u0000results on two comprehensive genomic benchmarks show that OmniGenome achieves\u0000state-of-the-art performance on complex RNA subtasks. For example, OmniGenome\u0000solved 74% of complex puzzles, compared to SpliceBERT which solved only 3% of\u0000the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour,\u0000while the existing methods usually allocate $24$ hours for each puzzle.\u0000Overall, OmniGenome establishes wide genomic application cases and offers\u0000profound insights into biological mechanisms from the perspective of\u0000sequence-structure alignment.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis CellAgent：用于单细胞数据自动分析的 LLM 驱动型多代理框架

arXiv - QuanBio - Genomics

Pub Date : 2024-07-13 DOI: arxiv-2407.09811

Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng

Single-cell RNA sequencing (scRNA-seq) data analysis is crucial forbiological research, as it enables the precise characterization of cellularheterogeneity. However, manual manipulation of various tools to achieve desiredoutcomes can be labor-intensive for researchers. To address this, we introduceCellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework,specifically designed for the automatic processing and execution of scRNA-seqdata analysis tasks, providing high-quality results with no human intervention.Firstly, to adapt general LLMs to the biological field, CellAgent constructsLLM-driven biological expert roles - planner, executor, and evaluator - eachwith specific responsibilities. Then, CellAgent introduces a hierarchicaldecision-making mechanism to coordinate these biological experts, effectivelydriving the planning and step-by-step execution of complex data analysis tasks.Furthermore, we propose a self-iterative optimization mechanism, enablingCellAgent to autonomously evaluate and optimize solutions, thereby guaranteeingoutput quality. We evaluate CellAgent on a comprehensive benchmark datasetencompassing dozens of tissues and hundreds of distinct cell types. Evaluationresults consistently show that CellAgent effectively identifies the mostsuitable tools and hyperparameters for single-cell analysis tasks, achievingoptimal performance. This automated framework dramatically reduces the workloadfor science data analyses, bringing us into the "Agent for Science" era.

单细胞 RNA 测序（scRNA-seq）数据分析对生物学研究至关重要，因为它能精确描述细胞的异质性。然而，手动操作各种工具以获得理想的结果可能会耗费研究人员大量的精力。为了解决这个问题，我们引入了细胞代理（CellAgent，http://cell.agent4science.cn/），这是一个 LLM 驱动的多代理框架，专门用于自动处理和执行 scRNA-seq 数据分析任务，无需人工干预即可提供高质量的结果。首先，为了使通用 LLM 适应生物领域，CellAgent 构建了 LLM 驱动的生物专家角色--规划者、执行者和评估者，每个角色都有特定的职责。然后，CellAgent 引入了一种分层决策机制来协调这些生物专家，从而有效地驱动复杂数据分析任务的规划和逐步执行。此外，我们还提出了一种自迭代优化机制，使 CellAgent 能够自主评估和优化解决方案，从而保证输出质量。我们在一个涵盖数十种组织和数百种不同细胞类型的综合基准数据集上对 CellAgent 进行了评估。评估结果一致表明，CellAgent 能有效识别最适合单细胞分析任务的工具和超参数，实现最佳性能。这一自动化框架大大减轻了科学数据分析的工作量，使我们进入了 "科学代理 "时代。

{"title":"CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis","authors":"Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng","doi":"arxiv-2407.09811","DOIUrl":"https://doi.org/arxiv-2407.09811","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for\u0000biological research, as it enables the precise characterization of cellular\u0000heterogeneity. However, manual manipulation of various tools to achieve desired\u0000outcomes can be labor-intensive for researchers. To address this, we introduce\u0000CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework,\u0000specifically designed for the automatic processing and execution of scRNA-seq\u0000data analysis tasks, providing high-quality results with no human intervention.\u0000Firstly, to adapt general LLMs to the biological field, CellAgent constructs\u0000LLM-driven biological expert roles - planner, executor, and evaluator - each\u0000with specific responsibilities. Then, CellAgent introduces a hierarchical\u0000decision-making mechanism to coordinate these biological experts, effectively\u0000driving the planning and step-by-step execution of complex data analysis tasks.\u0000Furthermore, we propose a self-iterative optimization mechanism, enabling\u0000CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing\u0000output quality. We evaluate CellAgent on a comprehensive benchmark dataset\u0000encompassing dozens of tissues and hundreds of distinct cell types. Evaluation\u0000results consistently show that CellAgent effectively identifies the most\u0000suitable tools and hyperparameters for single-cell analysis tasks, achieving\u0000optimal performance. This automated framework dramatically reduces the workload\u0000for science data analyses, bringing us into the \"Agent for Science\" era.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313 FastImpute：开源、无参照基因型推算方法的基线 -- PRS313 案例研究

arXiv - QuanBio - Genomics

Pub Date : 2024-07-12 DOI: arxiv-2407.09355

Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida

Genotype imputation enhances genetic data by predicting missing SNPs usingreference haplotype information. Traditional methods leverage linkagedisequilibrium (LD) to infer untyped SNP genotypes, relying on the similarityof LD structures between genotyped target sets and fully sequenced referencepanels. Recently, reference-free deep learning-based methods have emerged,offering a promising alternative by predicting missing genotypes withoutexternal databases, thereby enhancing privacy and accessibility. However, thesemethods often produce models with tens of millions of parameters, leading tochallenges such as the need for substantial computational resources to trainand inefficiency for client-sided deployment. Our study addresses theselimitations by introducing a baseline for a novel genotype imputation pipelinethat supports client-sided imputation models generalizable across anygenotyping chip and genomic region. This approach enhances patient privacy byperforming imputation directly on edge devices. As a case study, we focus onPRS313, a polygenic risk score comprising 313 SNPs used for breast cancer riskprediction. Utilizing consumer genetic panels such as 23andMe, our modeldemocratizes access to personalized genetic insights by allowing 23andMe usersto obtain their PRS313 score. We demonstrate that simple linear regression cansignificantly improve the accuracy of PRS313 scores when calculated using SNPsimputed from consumer gene panels, such as 23andMe. Our linear regression modelachieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 withsimple imputation (substituting missing SNPs with the minor allele frequency).These findings suggest that popular SNP analysis libraries could benefit fromintegrating linear regression models for genotype imputation, providing aviable and light-weight alternative to reference based imputation.

基因型推算是利用参考单倍型信息预测缺失的 SNP，从而增强遗传数据的能力。传统方法依赖基因分型目标集与完全测序参考集之间的 LD 结构相似性，利用连锁平衡（LD）来推断未分型的 SNP 基因型。最近，出现了基于无参考深度学习的方法，通过预测缺失的基因型而无需外部数据库，从而提高了私密性和可访问性，提供了一种有前途的替代方法。然而，这些方法通常会产生具有数千万个参数的模型，从而导致需要大量计算资源进行训练和客户端部署效率低下等挑战。我们的研究通过引入新型基因型估算管道的基线来解决上述限制，该管道支持可在任何基因分型芯片和基因组区域通用的客户端估算模型。这种方法通过直接在边缘设备上执行估算，提高了患者的隐私性。作为案例研究，我们将重点放在 PRS313 上，这是一个由 313 个 SNP 组成的多基因风险评分，用于乳腺癌风险预测。我们的模型利用 23andMe 等消费者基因面板，通过让 23andMe 用户获得他们的 PRS313 分数，使获取个性化基因见解的途径民主化。我们证明，在使用从 23andMe 等消费者基因面板中提取的 SNPs 计算 PRS313 分数时，简单的线性回归可以显著提高 PRS313 分数的准确性。我们的线性回归模型获得了 0.86 的 R^2，而不进行归因的 R^2 为 0.33，简单归因（用小等位基因频率替代缺失的 SNP）的 R^2 为 0.28。这些发现表明，流行的 SNP 分析库可以从整合线性回归模型的基因型归因中获益，为基于参考的归因提供了可行且轻量级的替代方案。

{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":"https://doi.org/arxiv-2407.09355","url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\u0000reference haplotype information. Traditional methods leverage linkage\u0000disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\u0000of LD structures between genotyped target sets and fully sequenced reference\u0000panels. Recently, reference-free deep learning-based methods have emerged,\u0000offering a promising alternative by predicting missing genotypes without\u0000external databases, thereby enhancing privacy and accessibility. However, these\u0000methods often produce models with tens of millions of parameters, leading to\u0000challenges such as the need for substantial computational resources to train\u0000and inefficiency for client-sided deployment. Our study addresses these\u0000limitations by introducing a baseline for a novel genotype imputation pipeline\u0000that supports client-sided imputation models generalizable across any\u0000genotyping chip and genomic region. This approach enhances patient privacy by\u0000performing imputation directly on edge devices. As a case study, we focus on\u0000PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\u0000prediction. Utilizing consumer genetic panels such as 23andMe, our model\u0000democratizes access to personalized genetic insights by allowing 23andMe users\u0000to obtain their PRS313 score. We demonstrate that simple linear regression can\u0000significantly improve the accuracy of PRS313 scores when calculated using SNPs\u0000imputed from consumer gene panels, such as 23andMe. Our linear regression model\u0000achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\u0000simple imputation (substituting missing SNPs with the minor allele frequency).\u0000These findings suggest that popular SNP analysis libraries could benefit from\u0000integrating linear regression models for genotype imputation, providing a\u0000viable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism 分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性

arXiv - QuanBio - Genomics

Pub Date : 2024-07-10 DOI: arxiv-2407.07718

Yifan Li, Giulia Guidi

In generating large quantities of DNA data, high-throughput sequencingtechnologies require advanced bioinformatics infrastructures for efficient dataanalysis. k-mer counting, the process of quantifying the frequency offixed-length k DNA subsequences, is a fundamental step in variousbioinformatics pipelines, including genome assembly and protein prediction. Dueto the growing volume of data, the scaling of the counting process is critical.In the literature, distributed memory software uses hash tables, which exhibitpoor cache friendliness and consume excessive memory. They often also lacksupport for flexible parallelism, which makes integration into existingbioinformatics pipelines difficult. In this work, we propose HySortK, a highlyefficient sorting-based distributed memory k-mer counter. HySortK reduces thecommunication volume through a carefully designed communication scheme anddomain-specific optimization strategies. Furthermore, we introduce an abstracttask layer for flexible hybrid parallelism to address load imbalances indifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPUbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortKachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.Finally, we integrated HySortK into an existing genome assembly pipeline andachieved up to 1.8x speedup, proving its flexibility and practicality inreal-world scenarios.

在生成大量 DNA 数据的过程中，高通量测序技术需要先进的生物信息学基础设施来进行高效的数据分析。k-mer 计数是量化固定长度 k DNA 子序列频率的过程，是基因组组装和蛋白质预测等各种生物信息学流水线的基本步骤。随着数据量的不断增长，计数过程的扩展至关重要。它们通常还缺乏对灵活并行性的支持，因此很难集成到现有的生物信息学流水线中。在这项工作中，我们提出了基于高效排序的分布式内存 k-mer 计数器 HySortK。HySortK 通过精心设计的通信方案和特定领域的优化策略减少了通信量。此外，我们还引入了用于灵活混合并行的抽象任务层，以解决不同场景下的负载不平衡问题。与 4 节点和 8 节点上的 GPU 基准相比，HySortK 的速度提高了 2-10 倍。最后，我们将 HySortK 集成到现有的基因组组装流水线中，并实现了高达 1.8 倍的速度提升，证明了它在现实世界场景中的灵活性和实用性。

{"title":"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":"https://doi.org/arxiv-2407.07718","url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\u0000technologies require advanced bioinformatics infrastructures for efficient data\u0000analysis. k-mer counting, the process of quantifying the frequency of\u0000fixed-length k DNA subsequences, is a fundamental step in various\u0000bioinformatics pipelines, including genome assembly and protein prediction. Due\u0000to the growing volume of data, the scaling of the counting process is critical.\u0000In the literature, distributed memory software uses hash tables, which exhibit\u0000poor cache friendliness and consume excessive memory. They often also lack\u0000support for flexible parallelism, which makes integration into existing\u0000bioinformatics pipelines difficult. In this work, we propose HySortK, a highly\u0000efficient sorting-based distributed memory k-mer counter. HySortK reduces the\u0000communication volume through a carefully designed communication scheme and\u0000domain-specific optimization strategies. Furthermore, we introduce an abstract\u0000task layer for flexible hybrid parallelism to address load imbalances in\u0000different scenarios. HySortK achieves a 2-10x speedup compared to the GPU\u0000baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\u0000achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\u0000Finally, we integrated HySortK into an existing genome assembly pipeline and\u0000achieved up to 1.8x speedup, proving its flexibility and practicality in\u0000real-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0