Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":"https://doi.org/arxiv-2407.07265","url":null,"abstract":"Protein language models trained on the masked language modeling objective\u0000learn to predict the identity of hidden amino acid residues within a sequence\u0000using the remaining observable sequence as context. They do so by embedding the\u0000residues into a high dimensional space that encapsulates the relevant\u0000contextual cues. These embedding vectors serve as an informative\u0000context-sensitive representation that not only aids with the defined training\u0000objective, but can also be used for other tasks by downstream models. We\u0000propose a scheme to use the embeddings of an unmasked sequence to estimate the\u0000corresponding masked probability vectors for all the positions in a single\u0000forward pass through the language model. This One Fell Swoop (OFS) approach\u0000allows us to efficiently estimate the pseudo-perplexity of the sequence, a\u0000measure of the model's uncertainty in its predictions, that can also serve as a\u0000fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\u0000well as the true pseudo-perplexity at fitness estimation, and more notably it\u0000defines a new state of the art on the ProteinGym Indels benchmark. The strong\u0000performance of the fitness measure prompted us to investigate if it could be\u0000used to detect the elevated stability reported in reconstructed ancestral\u0000sequences. We find that this measure ranks ancestral reconstructions as more\u0000fit than extant sequences. Finally, we show that the computational efficiency\u0000of the technique allows for the use of Monte Carlo methods that can rapidly\u0000explore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mammalian gut microbiomes are essential for host functions like digestion, immunity, and nutrient utilization. This study examines the gut microbiome of horses, donkeys, and their hybrids, mules and hinnies, to explore the role of microbiomes in hybrid vigor. We performed whole-genome sequencing on rectal microbiota from 18 equids, generating detailed microbiome assemblies. Our analysis revealed significant differences between horse and donkey microbiomes, with hybrids showing a pronounced maternal resemblance. Notably, Firmicutes were more abundant in the horse-maternal group, while Fibrobacteres were richer in the donkey-maternal group, indicating distinct digestive processes. Functional annotations indicated metabolic differences, such as protein synthesis in horses and energy metabolism in donkeys. Machine learning predictions of probiotic species highlighted potential health benefits for each maternal group. This study provides a high-resolution view of the equid gut microbiome, revealing significant taxonomic and metabolic differences influenced by maternal lineage, and offers insights into microbial contributions to hybrid vigor.
{"title":"Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids","authors":"Yihang Zhou","doi":"arxiv-2407.05076","DOIUrl":"https://doi.org/arxiv-2407.05076","url":null,"abstract":"Mammalian gut microbiomes are essential for host functions like digestion,\u0000immunity, and nutrient utilization. This study examines the gut microbiome of\u0000horses, donkeys, and their hybrids, mules and hinnies, to explore the role of\u0000microbiomes in hybrid vigor. We performed whole-genome sequencing on rectal\u0000microbiota from 18 equids, generating detailed microbiome assemblies. Our\u0000analysis revealed significant differences between horse and donkey microbiomes,\u0000with hybrids showing a pronounced maternal resemblance. Notably, Firmicutes\u0000were more abundant in the horse-maternal group, while Fibrobacteres were richer\u0000in the donkey-maternal group, indicating distinct digestive processes.\u0000Functional annotations indicated metabolic differences, such as protein\u0000synthesis in horses and energy metabolism in donkeys. Machine learning\u0000predictions of probiotic species highlighted potential health benefits for each\u0000maternal group. This study provides a high-resolution view of the equid gut\u0000microbiome, revealing significant taxonomic and metabolic differences\u0000influenced by maternal lineage, and offers insights into microbial\u0000contributions to hybrid vigor.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"368 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fescue toxicity causes reduced growth and reproductive issues in cattle grazing endophyte-infected tall fescue. To characterize the gut microbiota and its response to fescue toxicosis, we collected fecal samples before and after a 30-days toxic fescue seeds supplementation from eight Angus Simmental pregnant cows and heifers. We sequenced the 16 metagenomes using the whole-genome shotgun approach and generated 157 Gbp of metagenomic sequences. Through de novo assembly and annotation, we obtained a 13.1 Gbp reference contig assembly and identified 22 million microbial genes for cattle rectum microbiota. We discovered a significant reduction of microbial diversity after toxic seed treatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterial families and 31 species are significantly increased in the fecal microbiota (P-adj<0.05), including members of the top abundant rumen core taxa. This global elevation of rumen microbes in the rectum microbiota suggests a potential impairment of rumen microbiota under fescue toxicosis. Among these, Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumen microbiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% in feces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7 re-adapted to the large intestine environment under toxic fescue stress, causing this dramatic increase in abundance. Functional enrichment analysis revealed that the overrepresented pathways shifted from energy metabolism to antimicrobial resistance and DNA replication. In conclusion, we discovered dramatic microbiota alterations in composition, abundance, and functional capacities under fescue toxicosis, and our results suggest Ruminococcaceae bacterium P7 as a potential biomarker for fescue toxicosis management.
{"title":"Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis","authors":"Yihang Zhou","doi":"arxiv-2407.05055","DOIUrl":"https://doi.org/arxiv-2407.05055","url":null,"abstract":"Fescue toxicity causes reduced growth and reproductive issues in cattle\u0000grazing endophyte-infected tall fescue. To characterize the gut microbiota and\u0000its response to fescue toxicosis, we collected fecal samples before and after a\u000030-days toxic fescue seeds supplementation from eight Angus Simmental pregnant\u0000cows and heifers. We sequenced the 16 metagenomes using the whole-genome\u0000shotgun approach and generated 157 Gbp of metagenomic sequences. Through de\u0000novo assembly and annotation, we obtained a 13.1 Gbp reference contig assembly\u0000and identified 22 million microbial genes for cattle rectum microbiota. We\u0000discovered a significant reduction of microbial diversity after toxic seed\u0000treatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterial\u0000families and 31 species are significantly increased in the fecal microbiota\u0000(P-adj<0.05), including members of the top abundant rumen core taxa. This\u0000global elevation of rumen microbes in the rectum microbiota suggests a\u0000potential impairment of rumen microbiota under fescue toxicosis. Among these,\u0000Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumen\u0000microbiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% in\u0000feces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7\u0000re-adapted to the large intestine environment under toxic fescue stress,\u0000causing this dramatic increase in abundance. Functional enrichment analysis\u0000revealed that the overrepresented pathways shifted from energy metabolism to\u0000antimicrobial resistance and DNA replication. In conclusion, we discovered\u0000dramatic microbiota alterations in composition, abundance, and functional\u0000capacities under fescue toxicosis, and our results suggest Ruminococcaceae\u0000bacterium P7 as a potential biomarker for fescue toxicosis management.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable textbf{13%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.
DNA 序列编码着重要的遗传和生物信息,但这些长度不固定的序列无法作为普通数据挖掘算法的输入。因此,人们开发了各种表示方案,将 DNA 序列转换为固定长度的数字表示。然而,由于 DNA 数据的复杂性和稀疏性,这些方案在学习高质量表示时面临困难。此外,由于突变,DNA 序列本身就存在噪声。虽然已经提出了几种有效的方案,但它们往往缺乏语义结构,使得生物学家难以验证和利用这些结果。为了应对这些挑战,我们提出了一种基于稀疏恢复的可解释且稳健的 DNA 表示方案--textbf{Dy-mer}。利用DNA的基本语义结构,我们修改了传统的稀疏恢复方法,将频繁出现的K-mers表示为基向量,并通过简单的合并重建每个DNA序列,从而捕捉到表明生物功能的重复出现的模式。实验结果表明,textbf{Dy-mer}在DNA启动子分类中达到了最先进的性能,准确率提高了13%。此外,其固有的可解释性也为DNA聚类和主题检测提供了便利,提高了其在生物学研究中的实用性。
{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":"https://doi.org/arxiv-2407.12051","url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\u0000unfixed-length sequences cannot serve as the input of common data mining\u0000algorithms. Hence, various representation schemes have been developed to\u0000transform DNA sequences into fixed-length numerical representations. However,\u0000these schemes face difficulties in learning high-quality representations due to\u0000the complexity and sparsity of DNA data. Additionally, DNA sequences are\u0000inherently noisy because of mutations. While several schemes have been proposed\u0000for their effectiveness, they often lack semantic structure, making it\u0000difficult for biologists to validate and leverage the results. To address these\u0000challenges, we propose textbf{Dy-mer}, an explainable and robust DNA\u0000representation scheme based on sparse recovery. Leveraging the underlying\u0000semantic structure of DNA, we modify the traditional sparse recovery to capture\u0000recurring patterns indicative of biological functions by representing frequent\u0000K-mers as basis vectors and reconstructing each DNA sequence through simple\u0000concatenation. Experimental results demonstrate that textbf{Dy-mer} achieves\u0000state-of-the-art performance in DNA promoter classification, yielding a\u0000remarkable textbf{13%} increase in accuracy. Moreover, its inherent\u0000explainability facilitates DNA clustering and motif detection, enhancing its\u0000utility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. Therefore, interpreting these models may provide novel insights into the underlying biology, supporting downstream biomedical applications. Due to their complexity, interpretable surrogate models can only be built for local explanations (e.g., a single instance). However, accomplishing this requires generating a dataset in the neighborhood of the input, which must maintain syntactic similarity to the original data while introducing semantic variability in the model's predictions. This task is challenging due to the complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity. Our custom, domain-guided individual representation effectively constrains syntactic similarity, and we provide two alternative fitness functions that promote diversity with no computational effort. Applied to the RNA splicing domain, our approach quickly achieves good diversity and significantly outperforms a random baseline in exploring the search space, as shown by our proof-of-concept, short RNA sequence. Furthermore, we assess its generalizability and demonstrate scalability to larger sequences, resulting in a $approx$30% improvement over the baseline.
在基因组序列上训练的黑盒深度学习模型擅长预测不同基因调控机制的结果。因此,对这些模型进行解释可以提供对潜在生物学的新见解,从而支持下游的生物医学应用。由于其复杂性,可解释的代用模型只能针对局部解释(如单个实例)建立。然而,要做到这一点,需要在输入数据的邻域生成一个数据集,该数据集必须与原始数据保持句法相似性,同时在模型预测中引入语义可变性。由于 DNA 的序列与功能之间的关系非常复杂,因此这项任务极具挑战性。我们建议使用遗传编程法(Genetic Programming),通过对序列的扰动来生成数据集,从而促进数据集的语义多样性。我们定制的、以领域为导向的个体表示法有效地约束了句法相似性,我们还提供了两种可供选择的适合度函数,它们能在不增加计算工作量的情况下促进多样性。应用于 RNA 剪接领域,我们的方法很快就实现了良好的多样性,并且在探索搜索空间方面明显优于随机基线,正如我们的概念验证短 RNA 序列所显示的那样。此外,我们还评估了它的通用性,并证明了它对更大序列的可扩展性,结果比基线提高了大约30%。
{"title":"Semantically Rich Local Dataset Generation for Explainable AI in Genomics","authors":"Pedro Barbosa, Rosina Savisaar, Alcides Fonseca","doi":"arxiv-2407.02984","DOIUrl":"https://doi.org/arxiv-2407.02984","url":null,"abstract":"Black box deep learning models trained on genomic sequences excel at\u0000predicting the outcomes of different gene regulatory mechanisms. Therefore,\u0000interpreting these models may provide novel insights into the underlying\u0000biology, supporting downstream biomedical applications. Due to their\u0000complexity, interpretable surrogate models can only be built for local\u0000explanations (e.g., a single instance). However, accomplishing this requires\u0000generating a dataset in the neighborhood of the input, which must maintain\u0000syntactic similarity to the original data while introducing semantic\u0000variability in the model's predictions. This task is challenging due to the\u0000complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving\u0000perturbations in sequences that contribute to their semantic diversity. Our\u0000custom, domain-guided individual representation effectively constrains\u0000syntactic similarity, and we provide two alternative fitness functions that\u0000promote diversity with no computational effort. Applied to the RNA splicing\u0000domain, our approach quickly achieves good diversity and significantly\u0000outperforms a random baseline in exploring the search space, as shown by our\u0000proof-of-concept, short RNA sequence. Furthermore, we assess its\u0000generalizability and demonstrate scalability to larger sequences, resulting in\u0000a $approx$30% improvement over the baseline.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq, are critical tools for the biologist looking to analyze the genetic activity/transcriptome of a tissue or cell during an experimental procedure. Platforms like Illumina's next-generation sequencing (NGS) are used to produce the raw data for this experimental procedure. This raw FASTQ data must then be prepared via a complex series of data manipulations by bioinformaticians. This process currently takes place on an unwieldy textual user interface like a terminal/command line that requires the user to install and import multiple program packages, preventing the untrained biologist from initiating data analysis. Open-source platforms like Galaxy have produced a more user-friendly pipeline, yet the visual interface remains cluttered and highly technical, remaining uninviting for the natural scientist. To address this, SeqMate is a user-friendly tool that allows for one-click analytics by utilizing the power of a large language model (LLM) to automate both data preparation and analysis (differential expression, trajectory analysis, etc). Furthermore, by utilizing the power of generative AI, SeqMate is also capable of analyzing such findings and producing written reports of upregulated/downregulated/user-prompted genes with sources cited from known repositories like PubMed, PDB, and Uniprot.
{"title":"SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing","authors":"Devam Mondal, Atharva Inamdar","doi":"arxiv-2407.03381","DOIUrl":"https://doi.org/arxiv-2407.03381","url":null,"abstract":"RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq,\u0000are critical tools for the biologist looking to analyze the genetic\u0000activity/transcriptome of a tissue or cell during an experimental procedure.\u0000Platforms like Illumina's next-generation sequencing (NGS) are used to produce\u0000the raw data for this experimental procedure. This raw FASTQ data must then be\u0000prepared via a complex series of data manipulations by bioinformaticians. This\u0000process currently takes place on an unwieldy textual user interface like a\u0000terminal/command line that requires the user to install and import multiple\u0000program packages, preventing the untrained biologist from initiating data\u0000analysis. Open-source platforms like Galaxy have produced a more user-friendly\u0000pipeline, yet the visual interface remains cluttered and highly technical,\u0000remaining uninviting for the natural scientist. To address this, SeqMate is a\u0000user-friendly tool that allows for one-click analytics by utilizing the power\u0000of a large language model (LLM) to automate both data preparation and analysis\u0000(differential expression, trajectory analysis, etc). Furthermore, by utilizing\u0000the power of generative AI, SeqMate is also capable of analyzing such findings\u0000and producing written reports of upregulated/downregulated/user-prompted genes\u0000with sources cited from known repositories like PubMed, PDB, and Uniprot.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
本研究提出了 CGRclust,这是一种将 DNA 序列的混沌博弈表示(CGR)的无监督双对比聚类与卷积神经网络(CNN)相结合的新方法。据我们所知,CGRclust 是第一种使用无监督学习进行图像分类的方法(此处应用于二维 CGR 图像),用于对 DNA 序列数据集进行聚类。CGRclust 克服了传统序列分类方法的局限性,利用无监督双对比学习来检测独特的序列模式,而不需要 DNA 序列比对或生物/分类标签。CGRclust 对 25 个不同的数据集进行了精确聚类,序列长度从 664bp 到 100 kbp 不等,包括鱼类、真菌和原生动物的线粒体基因组,以及病毒全基因组组装和合成 DNA 序列。与三种最新的DNA序列聚类方法(DeLUCS、iDeLUCS和MeShClust v3.0)相比,CGRclust是唯一一种在鱼类线粒体DNA基因组的所有四个分类水平测试中准确率都超过81.70%的方法。CGRclust 在这 25 个数据集上的聚类准确率很高,而这 25 个数据集在序列长度、基因组数量、聚类数量和分类级别上都有很大差异,这证明了 CGRclust 的鲁棒性、可扩展性和多功能性。
{"title":"CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences","authors":"Fatemeh Alipour, Kathleen A. Hill, Lila Kari","doi":"arxiv-2407.02538","DOIUrl":"https://doi.org/arxiv-2407.02538","url":null,"abstract":"This study proposes CGRclust, a novel combination of unsupervised twin\u0000contrastive clustering of Chaos Game Representations (CGR) of DNA sequences,\u0000with convolutional neural networks (CNNs). To the best of our knowledge,\u0000CGRclust is the first method to use unsupervised learning for image\u0000classification (herein applied to two-dimensional CGR images) for clustering\u0000datasets of DNA sequences. CGRclust overcomes the limitations of traditional\u0000sequence classification methods by leveraging unsupervised twin contrastive\u0000learning to detect distinctive sequence patterns, without requiring DNA\u0000sequence alignment or biological/taxonomic labels. CGRclust accurately\u0000clustered twenty-five diverse datasets, with sequence lengths ranging from 664\u0000bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as\u0000well as viral whole genome assemblies and synthetic DNA sequences. Compared\u0000with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and\u0000MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy\u0000across all four taxonomic levels tested for mitochondrial DNA genomes of fish.\u0000Moreover, CGRclust also consistently demonstrates superior performance across\u0000all the viral genomic datasets. The high clustering accuracy of CGRclust on\u0000these twenty-five datasets, which vary significantly in terms of sequence\u0000length, number of genomes, number of clusters, and level of taxonomy,\u0000demonstrates its robustness, scalability, and versatility.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu
Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
元基因组学在许多领域都取得了重大进展。元基因组分析通常涉及确定样本中存在的物种及其相对丰度等关键任务。这些任务需要搜索大型的元基因组数据库。由于要从存储系统中移动大量低重复利用率的数据,元基因组分析需要大量的数据移动开销。存储内处理可以从根本上减少这种开销。然而,由于现代固态硬盘的硬件限制,现有的元基因组分析方法无法直接有效地在存储系统中实现,因此设计存储内处理系统具有挑战性。我们提出的 MegIS 是首个存储内处理系统,旨在显著减少端到端元基因组分析管道的数据移动开销。我们的轻量级设计有效地利用和协调了存储系统内部和外部的处理,从而使 MegIS 得以实现。我们通过专门而高效的1)任务分区、2)数据/计算流协调、3)存储技术感知算法优化、4)数据映射和5)轻量级存储内加速器,解决了元基因组学面临的存储内处理难题。MegIS的设计非常灵活,能够支持不同类型的元基因组输入数据集,并可集成到各种元基因组分析流水线中。我们的评估结果表明,MegIS的性能和准确性分别比最先进的性能优化软件元基因组工具高出2.7倍-37.2倍和6.9倍-100.2倍,而准确性则与准确性优化工具相当。与最先进的元基因组硬件加速(使用内存处理)工具相比,MegIS 的速度提高了 1.5 倍-5.1 倍,同时准确率也显著提高。
{"title":"MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing","authors":"Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu","doi":"arxiv-2406.19113","DOIUrl":"https://doi.org/arxiv-2406.19113","url":null,"abstract":"Metagenomics has led to significant advances in many fields. Metagenomic\u0000analysis commonly involves the key tasks of determining the species present in\u0000a sample and their relative abundances. These tasks require searching large\u0000metagenomic databases. Metagenomic analysis suffers from significant data\u0000movement overhead due to moving large amounts of low-reuse data from the\u0000storage system. In-storage processing can be a fundamental solution for\u0000reducing this overhead. However, designing an in-storage processing system for\u0000metagenomics is challenging because existing approaches to metagenomic analysis\u0000cannot be directly implemented in storage effectively due to the hardware\u0000limitations of modern SSDs. We propose MegIS, the first in-storage processing\u0000system designed to significantly reduce the data movement overhead of the\u0000end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight\u0000design that effectively leverages and orchestrates processing inside and\u0000outside the storage system. We address in-storage processing challenges for\u0000metagenomics via specialized and efficient 1) task partitioning, 2)\u0000data/computation flow coordination, 3) storage technology-aware algorithmic\u0000optimizations, 4) data mapping, and 5) lightweight in-storage accelerators.\u0000MegIS's design is flexible, capable of supporting different types of\u0000metagenomic input datasets, and can be integrated into various metagenomic\u0000analysis pipelines. Our evaluation shows that MegIS outperforms the\u0000state-of-the-art performance- and accuracy-optimized software metagenomic tools\u0000by 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, while\u0000matching the accuracy of the accuracy-optimized tool. MegIS achieves\u00001.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomic\u0000hardware-accelerated (using processing-in-memory) tool, while achieving\u0000significantly higher accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the sequential sample arrival, changing experiment conditions, and evolution of knowledge, the demand to continually visualize evolving structures of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes indispensable. However, as one of the state-of-the-art visualization and analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding (t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the demand well. To address these challenges, we introduce online t-SNE to seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by leveraging the embedding space of old samples, exploring the embedding space of new samples, and aligning the two embedding spaces on the fly. Consequently, online t-SNE dramatically enables the continual discovery of new structures and high-quality visualization of new scRNA-seq data without retraining from scratch. We showcase the formidable visualization capabilities of online t-SNE across diverse sequential scRNA-seq datasets.
{"title":"Online t-SNE for single-cell RNA-seq","authors":"Hui Ma, Kai Chen","doi":"arxiv-2406.14842","DOIUrl":"https://doi.org/arxiv-2406.14842","url":null,"abstract":"Due to the sequential sample arrival, changing experiment conditions, and\u0000evolution of knowledge, the demand to continually visualize evolving structures\u0000of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes\u0000indispensable. However, as one of the state-of-the-art visualization and\u0000analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding\u0000(t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the\u0000demand well. To address these challenges, we introduce online t-SNE to\u0000seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by\u0000leveraging the embedding space of old samples, exploring the embedding space of\u0000new samples, and aligning the two embedding spaces on the fly. Consequently,\u0000online t-SNE dramatically enables the continual discovery of new structures and\u0000high-quality visualization of new scRNA-seq data without retraining from\u0000scratch. We showcase the formidable visualization capabilities of online t-SNE\u0000across diverse sequential scRNA-seq datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at url{https://github.com/Liu-Hy/GenoTex}.
{"title":"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":"https://doi.org/arxiv-2406.15341","url":null,"abstract":"Recent advancements in machine learning have significantly improved the\u0000identification of disease-associated genes from gene expression datasets.\u0000However, these processes often require extensive expertise and manual effort,\u0000limiting their scalability. Large Language Model (LLM)-based agents have shown\u0000promise in automating these tasks due to their increasing problem-solving\u0000abilities. To support the evaluation and development of such methods, we\u0000introduce GenoTEX, a benchmark dataset for the automatic exploration of gene\u0000expression data, involving the tasks of dataset selection, preprocessing, and\u0000statistical analysis. GenoTEX provides annotated code and results for solving a\u0000wide range of gene identification problems, in a full analysis pipeline that\u0000follows the standard of computational genomics. These annotations are curated\u0000by human bioinformaticians who carefully analyze the datasets to ensure\u0000accuracy and reliability. To provide baselines for these tasks, we present\u0000GenoAgents, a team of LLM-based agents designed with context-aware planning,\u0000iterative correction, and domain expert consultation to collaboratively explore\u0000gene datasets. Our experiments with GenoAgents demonstrate the potential of\u0000LLM-based approaches in genomics data analysis, while error analysis highlights\u0000the challenges and areas for future improvement. We propose GenoTEX as a\u0000promising resource for benchmarking and enhancing AI-driven methods for\u0000genomics data analysis. We make our benchmark publicly available at\u0000url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}