High-dimensional single-cell data poses significant challenges in identifying underlying biological patterns due to the complexity and heterogeneity of cellular states. We propose a comprehensive gene-cell dependency visualization via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM), specifically designed for analyzing high-dimensional single-cell data like single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples in a hierarchical structure such that the self-growth structure of clusters satisfies the required variations between and within. We propose a novel Significant Attributes Identification Algorithm to identify features that distinguish clusters. This algorithm pinpoints attributes with minimal variation within a cluster but substantial variation between clusters. These key attributes can then be used for targeted data retrieval and downstream analysis. Furthermore, we present two innovative visualization tools: Cluster Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights the distribution of specific features across the hierarchical structure of GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness based on chosen features. The Cluster Distribution Map depicts leaf clusters as circles on the GHSOM grid, with circle size reflecting cluster data size and color customizable to visualize features like cell type or other attributes. We apply our analysis to three single-cell datasets and one CRISPR dataset (cell-gene database) and evaluate clustering methods with internal and external CH and ARI scores. GHSOM performs well, being the best performer in internal evaluation (CH=4.2). In external evaluation, GHSOM has the third-best performance of all methods.
由于细胞状态的复杂性和异质性,高维单细胞数据给识别潜在的生物模式带来了巨大挑战。我们提出了一种通过无监督聚类实现基因-细胞依赖关系可视化的综合方法--生长分层自组织图(GHSOM),专门用于分析单细胞测序和CRISPR筛选的高维单细胞数据。GHSOM 采用分层结构对样本进行聚类,这样聚类的自生长结构就能满足样本之间和样本内部的变化要求。我们提出了一种新颖的 "重要属性识别算法"(Significant Attributes Identification Algorithm)来识别区分聚类的特征。该算法能找出在聚类内部变化最小,但在聚类之间变化很大的属性。这些关键属性可用于有针对性的数据检索和下游分析。此外,我们还介绍了两种创新的可视化工具:聚类特征图(ClusterFeature Map)和聚类分布图(Cluster Distribution Map)。聚类特征图突出显示了特定特征在 GHSOM 聚类分层结构中的分布。这样就可以根据所选特征快速直观地评估聚类的独特性。簇分布图将叶簇描绘成 GHSOM 网格上的圆圈,圆圈大小反映了簇数据的大小,颜色可自定义,以直观显示细胞类型或其他属性等特征。我们将分析结果应用于三个单细胞数据集和一个 CRISPR 数据集(细胞基因数据库),并用内部、外部CH 和 ARI 分数评估聚类方法。GHSOM 表现出色,是内部评估中表现最好的方法(CH=4.2)。在外部评估中,GHSOM 的表现在所有方法中名列第三。
{"title":"scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM","authors":"Shang-Jung Wen, Jia-Ming Chang, Fang Yu","doi":"arxiv-2407.16984","DOIUrl":"https://doi.org/arxiv-2407.16984","url":null,"abstract":"High-dimensional single-cell data poses significant challenges in identifying\u0000underlying biological patterns due to the complexity and heterogeneity of\u0000cellular states. We propose a comprehensive gene-cell dependency visualization\u0000via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\u0000specifically designed for analyzing high-dimensional single-cell data like\u0000single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\u0000in a hierarchical structure such that the self-growth structure of clusters\u0000satisfies the required variations between and within. We propose a novel\u0000Significant Attributes Identification Algorithm to identify features that\u0000distinguish clusters. This algorithm pinpoints attributes with minimal\u0000variation within a cluster but substantial variation between clusters. These\u0000key attributes can then be used for targeted data retrieval and downstream\u0000analysis. Furthermore, we present two innovative visualization tools: Cluster\u0000Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights\u0000the distribution of specific features across the hierarchical structure of\u0000GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\u0000based on chosen features. The Cluster Distribution Map depicts leaf clusters as\u0000circles on the GHSOM grid, with circle size reflecting cluster data size and\u0000color customizable to visualize features like cell type or other attributes. We\u0000apply our analysis to three single-cell datasets and one CRISPR dataset\u0000(cell-gene database) and evaluate clustering methods with internal and external\u0000CH and ARI scores. GHSOM performs well, being the best performer in internal\u0000evaluation (CH=4.2). In external evaluation, GHSOM has the third-best\u0000performance of all methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) Construction of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant combinations, and 156 unique clinically verified GVs from real-world patients. (ii) Analysis of the structure and properties of the dataset. (iii) Experimentation of the dataset with pre-trained GFMs. The results show a significant gap between GFMs current capabilities and accurate GV representation. We hope this dataset will help advance genomic deep learning to bridge this gap.
{"title":"GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning","authors":"Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang","doi":"arxiv-2407.16940","DOIUrl":"https://doi.org/arxiv-2407.16940","url":null,"abstract":"Genetic variants (GVs) are defined as differences in the DNA sequences among\u0000individuals and play a crucial role in diagnosing and treating genetic\u0000diseases. The rapid decrease in next generation sequencing cost has led to an\u0000exponential increase in patient-level GV data. This growth poses a challenge\u0000for clinicians who must efficiently prioritize patient-specific GVs and\u0000integrate them with existing genomic databases to inform patient management. To\u0000addressing the interpretation of GVs, genomic foundation models (GFMs) have\u0000emerged. However, these models lack standardized performance assessments,\u0000leading to considerable variability in model evaluations. This poses the\u0000question: How effectively do deep learning methods classify unknown GVs and\u0000align them with clinically-verified GVs? We argue that representation learning,\u0000which transforms raw data into meaningful feature spaces, is an effective\u0000approach for addressing both indexing and classification challenges. We\u0000introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring\u0000variable-length contexts and detailed annotations, designed for deep learning\u0000models to learn GV representations across various traits, diseases, tissue\u0000types, and experimental contexts. Our contributions are three-fold: (i)\u0000Construction of a comprehensive dataset with 7 million records, each labeled\u0000with characteristics of the corresponding variants, alongside additional data\u0000from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant\u0000combinations, and 156 unique clinically verified GVs from real-world patients.\u0000(ii) Analysis of the structure and properties of the dataset. (iii)\u0000Experimentation of the dataset with pre-trained GFMs. The results show a\u0000significant gap between GFMs current capabilities and accurate GV\u0000representation. We hope this dataset will help advance genomic deep learning to\u0000bridge this gap.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu
Artificial Intelligence (AI) has emerged as a key driver of precision agriculture, facilitating enhanced crop productivity, optimized resource use, farm sustainability, and informed decision-making. Also, the expansion of genome sequencing technology has greatly increased crop genomic resources, deepening our understanding of genetic variation and enhancing desirable crop traits to optimize performance in various environments. There is increasing interest in using machine learning (ML) and deep learning (DL) algorithms for genotype-to-phenotype prediction due to their excellence in capturing complex interactions within large, high-dimensional datasets. In this work, we propose a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction, specifically for flowering time and grain yield estimation, which could potentially help optimize yields and management practices. Our model outperformed the other baseline methods, demonstrating its potential in handling complex high-dimensional agricultural datasets and enhancing crop phenotype prediction performance.
{"title":"LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction","authors":"Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu","doi":"arxiv-2407.16709","DOIUrl":"https://doi.org/arxiv-2407.16709","url":null,"abstract":"Artificial Intelligence (AI) has emerged as a key driver of precision\u0000agriculture, facilitating enhanced crop productivity, optimized resource use,\u0000farm sustainability, and informed decision-making. Also, the expansion of\u0000genome sequencing technology has greatly increased crop genomic resources,\u0000deepening our understanding of genetic variation and enhancing desirable crop\u0000traits to optimize performance in various environments. There is increasing\u0000interest in using machine learning (ML) and deep learning (DL) algorithms for\u0000genotype-to-phenotype prediction due to their excellence in capturing complex\u0000interactions within large, high-dimensional datasets. In this work, we propose\u0000a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction,\u0000specifically for flowering time and grain yield estimation, which could\u0000potentially help optimize yields and management practices. Our model\u0000outperformed the other baseline methods, demonstrating its potential in\u0000handling complex high-dimensional agricultural datasets and enhancing crop\u0000phenotype prediction performance.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun
Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
从 DNA 序列预测基因功能是生物学的一项基本挑战。许多深度学习模型被提出来嵌入 DNA 序列并预测其酶功能,利用公共数据库中的信息将 DNA 序列与酶功能标签联系起来。然而,科学界关于生物功能的大部分知识并没有体现在这些分类标签中,而是体现在关于机制、反应和酶行为的非结构化文本描述中。这些描述通常与生物数据库中的 DNA 序列一起被捕获,尽管是以非结构化的方式捕获的。预测酶功能的深度学习模型很可能受益于这些编码生物功能科学知识的多模式数据。然而,目前还没有专为机器学习算法设计的数据集来利用这些多模态信息。在这里,我们提出了一个新颖的数据集和基准套装,它可以在基因 DNA 序列和基因功能自然语言描述上探索和开发大型多模态神经网络模型。我们展示了无监督和有监督任务的基准性能,证明了这一建模目标的难度,同时也证明了与单独的 DNA 序列相比,在功能预测中纳入多模态数据类型的潜在好处。我们的数据集在:https://hoarfrost-lab.github.io/BioTalk/。
{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":"https://doi.org/arxiv-2407.15888","url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\u0000biology. Many deep learning models have been proposed to embed DNA sequences\u0000and predict their enzymatic function, leveraging information in public\u0000databases linking DNA sequences to an enzymatic function label. However, much\u0000of the scientific community's knowledge of biological function is not\u0000represented in these categorical labels, and is instead captured in\u0000unstructured text descriptions of mechanisms, reactions, and enzyme behavior.\u0000These descriptions are often captured alongside DNA sequences in biological\u0000databases, albeit in an unstructured manner. Deep learning of models predicting\u0000enzymatic function are likely to benefit from incorporating this multi-modal\u0000data encoding scientific knowledge of biological function. There is, however,\u0000no dataset designed for machine learning algorithms to leverage this\u0000multi-modal information. Here we propose a novel dataset and benchmark suite\u0000that enables the exploration and development of large multi-modal neural\u0000network models on gene DNA sequences and natural language descriptions of gene\u0000function. We present baseline performance on benchmarks for both unsupervised\u0000and supervised tasks that demonstrate the difficulty of this modeling\u0000objective, while demonstrating the potential benefit of incorporating\u0000multi-modal data types in function prediction compared to DNA sequences alone.\u0000Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid development of spatial transcriptomics (ST) technologies is revolutionizing our understanding of the spatial organization of biological tissues. Current ST methods, categorized into next-generation sequencing-based (seq-based) and fluorescence in situ hybridization-based (image-based) methods, offer innovative insights into the functional dynamics of biological tissues. However, these methods are limited by their cellular resolution and the quantity of genes they can detect. To address these limitations, we propose SpaDiT, a deep learning method that utilizes a diffusion generative model to integrate scRNA-seq and ST data for the prediction of undetected genes. By employing a Transformer-based diffusion model, SpaDiT not only accurately predicts unknown genes but also effectively generates the spatial structure of ST genes. We have demonstrated the effectiveness of SpaDiT through extensive experiments on both seq-based and image-based ST data. SpaDiT significantly contributes to ST gene prediction methods with its innovative approach. Compared to eight leading baseline methods, SpaDiT achieved state-of-the-art performance across multiple metrics, highlighting its substantial bioinformatics contribution.
空间转录组学(ST)技术的快速发展正在彻底改变我们对生物组织空间组织的认识。目前的空间转录组学方法分为基于下一代测序的方法(基于测序)和基于荧光原位杂交的方法(基于图像),这些方法提供了对生物组织功能动态的创新见解。为了解决这些局限性,我们提出了一种深度学习方法SpaDiT,它利用扩散生成模型整合scRNA-seq和ST数据,预测未检测到的基因。通过采用基于变压器的扩散模型,SpaDiT 不仅能准确预测未知基因,还能有效生成 ST 基因的空间结构。我们在基于序列和图像的 ST 数据上进行了大量实验,证明了 SpaDiT 的有效性。与八种领先的基线方法相比,SpaDiT 在多个指标上都达到了最先进的水平,凸显了它在生物信息学方面的巨大贡献。
{"title":"SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq","authors":"Xiaoyu Li, Fangfang Zhu, Wenwen Min","doi":"arxiv-2407.13182","DOIUrl":"https://doi.org/arxiv-2407.13182","url":null,"abstract":"The rapid development of spatial transcriptomics (ST) technologies is\u0000revolutionizing our understanding of the spatial organization of biological\u0000tissues. Current ST methods, categorized into next-generation sequencing-based\u0000(seq-based) and fluorescence in situ hybridization-based (image-based) methods,\u0000offer innovative insights into the functional dynamics of biological tissues.\u0000However, these methods are limited by their cellular resolution and the\u0000quantity of genes they can detect. To address these limitations, we propose\u0000SpaDiT, a deep learning method that utilizes a diffusion generative model to\u0000integrate scRNA-seq and ST data for the prediction of undetected genes. By\u0000employing a Transformer-based diffusion model, SpaDiT not only accurately\u0000predicts unknown genes but also effectively generates the spatial structure of\u0000ST genes. We have demonstrated the effectiveness of SpaDiT through extensive\u0000experiments on both seq-based and image-based ST data. SpaDiT significantly\u0000contributes to ST gene prediction methods with its innovative approach.\u0000Compared to eight leading baseline methods, SpaDiT achieved state-of-the-art\u0000performance across multiple metrics, highlighting its substantial\u0000bioinformatics contribution.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean Dewar, Georg Grasegger, Kaie Kubjas, Fatemeh Mohammadi, Anthony Nixon
This article considers the problem of 3-dimensional genome reconstruction for single-cell data, and the uniqueness of such reconstructions in the setting of haploid organisms. We consider multiple graph models as representations of this problem, and use techniques from graph rigidity theory to determine identifiability. Biologically, our models come from Hi-C data, microscopy data, and combinations thereof. Mathematically, we use unit ball and sphere packing models, as well as models consisting of distance and inequality constraints. In each setting, we describe and/or derive new results on realisability and uniqueness. We then propose a 3D reconstruction method based on semidefinite programming and apply it to synthetic and real data sets using our models.
{"title":"Single-cell 3D genome reconstruction in the haploid setting using rigidity theory","authors":"Sean Dewar, Georg Grasegger, Kaie Kubjas, Fatemeh Mohammadi, Anthony Nixon","doi":"arxiv-2407.10700","DOIUrl":"https://doi.org/arxiv-2407.10700","url":null,"abstract":"This article considers the problem of 3-dimensional genome reconstruction for\u0000single-cell data, and the uniqueness of such reconstructions in the setting of\u0000haploid organisms. We consider multiple graph models as representations of this\u0000problem, and use techniques from graph rigidity theory to determine\u0000identifiability. Biologically, our models come from Hi-C data, microscopy data,\u0000and combinations thereof. Mathematically, we use unit ball and sphere packing\u0000models, as well as models consisting of distance and inequality constraints. In\u0000each setting, we describe and/or derive new results on realisability and\u0000uniqueness. We then propose a 3D reconstruction method based on semidefinite\u0000programming and apply it to synthetic and real data sets using our models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The structures of RNA sequences play a vital role in various cellular processes, while existing genomic foundation models (FMs) have struggled with precise sequence-structure alignment, due to the complexity of exponential combinations of nucleotide bases. In this study, we introduce OmniGenome, a foundation model that addresses this critical challenge of sequence-structure alignment in RNA FMs. OmniGenome bridges the sequences with secondary structures using structure-contextualized modeling, enabling hard in-silico genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The results on two comprehensive genomic benchmarks show that OmniGenome achieves state-of-the-art performance on complex RNA subtasks. For example, OmniGenome solved 74% of complex puzzles, compared to SpliceBERT which solved only 3% of the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour, while the existing methods usually allocate $24$ hours for each puzzle. Overall, OmniGenome establishes wide genomic application cases and offers profound insights into biological mechanisms from the perspective of sequence-structure alignment.
{"title":"OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models","authors":"Heng Yang, Ke Li","doi":"arxiv-2407.11242","DOIUrl":"https://doi.org/arxiv-2407.11242","url":null,"abstract":"The structures of RNA sequences play a vital role in various cellular\u0000processes, while existing genomic foundation models (FMs) have struggled with\u0000precise sequence-structure alignment, due to the complexity of exponential\u0000combinations of nucleotide bases. In this study, we introduce OmniGenome, a\u0000foundation model that addresses this critical challenge of sequence-structure\u0000alignment in RNA FMs. OmniGenome bridges the sequences with secondary\u0000structures using structure-contextualized modeling, enabling hard in-silico\u0000genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The\u0000results on two comprehensive genomic benchmarks show that OmniGenome achieves\u0000state-of-the-art performance on complex RNA subtasks. For example, OmniGenome\u0000solved 74% of complex puzzles, compared to SpliceBERT which solved only 3% of\u0000the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour,\u0000while the existing methods usually allocate $24$ hours for each puzzle.\u0000Overall, OmniGenome establishes wide genomic application cases and offers\u0000profound insights into biological mechanisms from the perspective of\u0000sequence-structure alignment.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research, as it enables the precise characterization of cellular heterogeneity. However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers. To address this, we introduce CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework, specifically designed for the automatic processing and execution of scRNA-seq data analysis tasks, providing high-quality results with no human intervention. Firstly, to adapt general LLMs to the biological field, CellAgent constructs LLM-driven biological expert roles - planner, executor, and evaluator - each with specific responsibilities. Then, CellAgent introduces a hierarchical decision-making mechanism to coordinate these biological experts, effectively driving the planning and step-by-step execution of complex data analysis tasks. Furthermore, we propose a self-iterative optimization mechanism, enabling CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing output quality. We evaluate CellAgent on a comprehensive benchmark dataset encompassing dozens of tissues and hundreds of distinct cell types. Evaluation results consistently show that CellAgent effectively identifies the most suitable tools and hyperparameters for single-cell analysis tasks, achieving optimal performance. This automated framework dramatically reduces the workload for science data analyses, bringing us into the "Agent for Science" era.
{"title":"CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis","authors":"Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng","doi":"arxiv-2407.09811","DOIUrl":"https://doi.org/arxiv-2407.09811","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for\u0000biological research, as it enables the precise characterization of cellular\u0000heterogeneity. However, manual manipulation of various tools to achieve desired\u0000outcomes can be labor-intensive for researchers. To address this, we introduce\u0000CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework,\u0000specifically designed for the automatic processing and execution of scRNA-seq\u0000data analysis tasks, providing high-quality results with no human intervention.\u0000Firstly, to adapt general LLMs to the biological field, CellAgent constructs\u0000LLM-driven biological expert roles - planner, executor, and evaluator - each\u0000with specific responsibilities. Then, CellAgent introduces a hierarchical\u0000decision-making mechanism to coordinate these biological experts, effectively\u0000driving the planning and step-by-step execution of complex data analysis tasks.\u0000Furthermore, we propose a self-iterative optimization mechanism, enabling\u0000CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing\u0000output quality. We evaluate CellAgent on a comprehensive benchmark dataset\u0000encompassing dozens of tissues and hundreds of distinct cell types. Evaluation\u0000results consistently show that CellAgent effectively identifies the most\u0000suitable tools and hyperparameters for single-cell analysis tasks, achieving\u0000optimal performance. This automated framework dramatically reduces the workload\u0000for science data analyses, bringing us into the \"Agent for Science\" era.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida
Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.
{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":"https://doi.org/arxiv-2407.09355","url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\u0000reference haplotype information. Traditional methods leverage linkage\u0000disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\u0000of LD structures between genotyped target sets and fully sequenced reference\u0000panels. Recently, reference-free deep learning-based methods have emerged,\u0000offering a promising alternative by predicting missing genotypes without\u0000external databases, thereby enhancing privacy and accessibility. However, these\u0000methods often produce models with tens of millions of parameters, leading to\u0000challenges such as the need for substantial computational resources to train\u0000and inefficiency for client-sided deployment. Our study addresses these\u0000limitations by introducing a baseline for a novel genotype imputation pipeline\u0000that supports client-sided imputation models generalizable across any\u0000genotyping chip and genomic region. This approach enhances patient privacy by\u0000performing imputation directly on edge devices. As a case study, we focus on\u0000PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\u0000prediction. Utilizing consumer genetic panels such as 23andMe, our model\u0000democratizes access to personalized genetic insights by allowing 23andMe users\u0000to obtain their PRS313 score. We demonstrate that simple linear regression can\u0000significantly improve the accuracy of PRS313 scores when calculated using SNPs\u0000imputed from consumer gene panels, such as 23andMe. Our linear regression model\u0000achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\u0000simple imputation (substituting missing SNPs with the minor allele frequency).\u0000These findings suggest that popular SNP analysis libraries could benefit from\u0000integrating linear regression models for genotype imputation, providing a\u0000viable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.
在生成大量 DNA 数据的过程中,高通量测序技术需要先进的生物信息学基础设施来进行高效的数据分析。k-mer 计数是量化固定长度 k DNA 子序列频率的过程,是基因组组装和蛋白质预测等各种生物信息学流水线的基本步骤。随着数据量的不断增长,计数过程的扩展至关重要。它们通常还缺乏对灵活并行性的支持,因此很难集成到现有的生物信息学流水线中。在这项工作中,我们提出了基于高效排序的分布式内存 k-mer 计数器 HySortK。HySortK 通过精心设计的通信方案和特定领域的优化策略减少了通信量。此外,我们还引入了用于灵活混合并行的抽象任务层,以解决不同场景下的负载不平衡问题。与 4 节点和 8 节点上的 GPU 基准相比,HySortK 的速度提高了 2-10 倍。最后,我们将 HySortK 集成到现有的基因组组装流水线中,并实现了高达 1.8 倍的速度提升,证明了它在现实世界场景中的灵活性和实用性。
{"title":"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":"https://doi.org/arxiv-2407.07718","url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\u0000technologies require advanced bioinformatics infrastructures for efficient data\u0000analysis. k-mer counting, the process of quantifying the frequency of\u0000fixed-length k DNA subsequences, is a fundamental step in various\u0000bioinformatics pipelines, including genome assembly and protein prediction. Due\u0000to the growing volume of data, the scaling of the counting process is critical.\u0000In the literature, distributed memory software uses hash tables, which exhibit\u0000poor cache friendliness and consume excessive memory. They often also lack\u0000support for flexible parallelism, which makes integration into existing\u0000bioinformatics pipelines difficult. In this work, we propose HySortK, a highly\u0000efficient sorting-based distributed memory k-mer counter. HySortK reduces the\u0000communication volume through a carefully designed communication scheme and\u0000domain-specific optimization strategies. Furthermore, we introduce an abstract\u0000task layer for flexible hybrid parallelism to address load imbalances in\u0000different scenarios. HySortK achieves a 2-10x speedup compared to the GPU\u0000baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\u0000achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\u0000Finally, we integrated HySortK into an existing genome assembly pipeline and\u0000achieved up to 1.8x speedup, proving its flexibility and practicality in\u0000real-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}