arXiv - QuanBio - Genomics最新文献_第5页

Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation 一举解决蛋白质适宜性估算的伪复杂性问题

arXiv - QuanBio - Genomics

Pub Date : 2024-07-09 DOI: arxiv-2407.07265

Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta

Protein language models trained on the masked language modeling objectivelearn to predict the identity of hidden amino acid residues within a sequenceusing the remaining observable sequence as context. They do so by embedding theresidues into a high dimensional space that encapsulates the relevantcontextual cues. These embedding vectors serve as an informativecontext-sensitive representation that not only aids with the defined trainingobjective, but can also be used for other tasks by downstream models. Wepropose a scheme to use the embeddings of an unmasked sequence to estimate thecorresponding masked probability vectors for all the positions in a singleforward pass through the language model. This One Fell Swoop (OFS) approachallows us to efficiently estimate the pseudo-perplexity of the sequence, ameasure of the model's uncertainty in its predictions, that can also serve as afitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly aswell as the true pseudo-perplexity at fitness estimation, and more notably itdefines a new state of the art on the ProteinGym Indels benchmark. The strongperformance of the fitness measure prompted us to investigate if it could beused to detect the elevated stability reported in reconstructed ancestralsequences. We find that this measure ranks ancestral reconstructions as morefit than extant sequences. Finally, we show that the computational efficiencyof the technique allows for the use of Monte Carlo methods that can rapidlyexplore functional sequence space.

根据掩码语言建模目标训练的蛋白质语言模型，可以利用剩余的可观测序列作为上下文，学习预测序列中隐藏氨基酸残基的身份。这些模型通过将残基嵌入包含相关上下文线索的高维空间来实现这一目的。这些嵌入向量是对上下文敏感的信息表征，不仅有助于实现所定义的训练目标，还能被下游模型用于其他任务。我们提出了一种方案，利用未掩码序列的嵌入向量来估计对应的掩码概率向量，从而在语言模型中对所有位置进行一次前向传递。这种 "一击即中"（OFS）方法允许我们高效地估算序列的伪复杂度，即模型预测不确定性的度量，它也可以用作正确性估算。我们发现，ESM2 OFS 伪复杂度在适配性估计方面的表现几乎与真正的伪复杂度一样好，更值得一提的是，它在 ProteinGym Indels 基准上定义了一种新的技术状态。适配性测量的强大性能促使我们研究是否可以用它来检测重建祖先序列中的高稳定性。我们发现，与现存序列相比，该指标将祖先重建序列列为更适合的序列。最后，我们证明该技术的计算效率允许使用蒙特卡罗方法快速探索功能序列空间。

{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":"https://doi.org/arxiv-2407.07265","url":null,"abstract":"Protein language models trained on the masked language modeling objective\u0000learn to predict the identity of hidden amino acid residues within a sequence\u0000using the remaining observable sequence as context. They do so by embedding the\u0000residues into a high dimensional space that encapsulates the relevant\u0000contextual cues. These embedding vectors serve as an informative\u0000context-sensitive representation that not only aids with the defined training\u0000objective, but can also be used for other tasks by downstream models. We\u0000propose a scheme to use the embeddings of an unmasked sequence to estimate the\u0000corresponding masked probability vectors for all the positions in a single\u0000forward pass through the language model. This One Fell Swoop (OFS) approach\u0000allows us to efficiently estimate the pseudo-perplexity of the sequence, a\u0000measure of the model's uncertainty in its predictions, that can also serve as a\u0000fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\u0000well as the true pseudo-perplexity at fitness estimation, and more notably it\u0000defines a new state of the art on the ProteinGym Indels benchmark. The strong\u0000performance of the fitness measure prompted us to investigate if it could be\u0000used to detect the elevated stability reported in reconstructed ancestral\u0000sequences. We find that this measure ranks ancestral reconstructions as more\u0000fit than extant sequences. Finally, we show that the computational efficiency\u0000of the technique allows for the use of Monte Carlo methods that can rapidly\u0000explore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids 元基因组分析揭示了马和驴肠道微生物群的共同特征和不同特征，以及杂交马科动物微生物群的母源性相似性

arXiv - QuanBio - Genomics

Pub Date : 2024-07-06 DOI: arxiv-2407.05076

Yihang Zhou

Mammalian gut microbiomes are essential for host functions like digestion,immunity, and nutrient utilization. This study examines the gut microbiome ofhorses, donkeys, and their hybrids, mules and hinnies, to explore the role ofmicrobiomes in hybrid vigor. We performed whole-genome sequencing on rectalmicrobiota from 18 equids, generating detailed microbiome assemblies. Ouranalysis revealed significant differences between horse and donkey microbiomes,with hybrids showing a pronounced maternal resemblance. Notably, Firmicuteswere more abundant in the horse-maternal group, while Fibrobacteres were richerin the donkey-maternal group, indicating distinct digestive processes.Functional annotations indicated metabolic differences, such as proteinsynthesis in horses and energy metabolism in donkeys. Machine learningpredictions of probiotic species highlighted potential health benefits for eachmaternal group. This study provides a high-resolution view of the equid gutmicrobiome, revealing significant taxonomic and metabolic differencesinfluenced by maternal lineage, and offers insights into microbialcontributions to hybrid vigor.

哺乳动物肠道微生物组对宿主的消化、免疫和营养利用等功能至关重要。本研究对马、驴及其杂交种骡子和駃騠的肠道微生物组进行了研究，以探索微生物组在杂交种活力中的作用。我们对 18 种马的直肠微生物群进行了全基因组测序，生成了详细的微生物组组合。我们的分析表明，马和驴的微生物组之间存在显著差异，杂交马和驴的微生物组具有明显的母源性相似性。值得注意的是，在马的母体组中，固着菌更为丰富，而在驴的母体组中，纤维细菌更为丰富，这表明了不同的消化过程。功能注释表明了新陈代谢的差异，如马的蛋白质合成和驴的能量代谢。益生菌种类的机器学习预测突出了每个母体组的潜在健康益处。这项研究提供了马属动物肠道微生物组的高分辨率视图，揭示了受母系影响的显著分类和代谢差异，并提供了微生物对杂种活力贡献的见解。

{"title":"Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids","authors":"Yihang Zhou","doi":"arxiv-2407.05076","DOIUrl":"https://doi.org/arxiv-2407.05076","url":null,"abstract":"Mammalian gut microbiomes are essential for host functions like digestion,\u0000immunity, and nutrient utilization. This study examines the gut microbiome of\u0000horses, donkeys, and their hybrids, mules and hinnies, to explore the role of\u0000microbiomes in hybrid vigor. We performed whole-genome sequencing on rectal\u0000microbiota from 18 equids, generating detailed microbiome assemblies. Our\u0000analysis revealed significant differences between horse and donkey microbiomes,\u0000with hybrids showing a pronounced maternal resemblance. Notably, Firmicutes\u0000were more abundant in the horse-maternal group, while Fibrobacteres were richer\u0000in the donkey-maternal group, indicating distinct digestive processes.\u0000Functional annotations indicated metabolic differences, such as protein\u0000synthesis in horses and energy metabolism in donkeys. Machine learning\u0000predictions of probiotic species highlighted potential health benefits for each\u0000maternal group. This study provides a high-resolution view of the equid gut\u0000microbiome, revealing significant taxonomic and metabolic differences\u0000influenced by maternal lineage, and offers insights into microbial\u0000contributions to hybrid vigor.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"368 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis 元基因组分析揭示了羊茅中毒情况下牛直肠微生物组和抗菌药耐药性组的显著变化

arXiv - QuanBio - Genomics

Pub Date : 2024-07-06 DOI: arxiv-2407.05055

Yihang Zhou

Fescue toxicity causes reduced growth and reproductive issues in cattlegrazing endophyte-infected tall fescue. To characterize the gut microbiota andits response to fescue toxicosis, we collected fecal samples before and after a30-days toxic fescue seeds supplementation from eight Angus Simmental pregnantcows and heifers. We sequenced the 16 metagenomes using the whole-genomeshotgun approach and generated 157 Gbp of metagenomic sequences. Through denovo assembly and annotation, we obtained a 13.1 Gbp reference contig assemblyand identified 22 million microbial genes for cattle rectum microbiota. Wediscovered a significant reduction of microbial diversity after toxic seedtreatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterialfamilies and 31 species are significantly increased in the fecal microbiota(P-adj<0.05), including members of the top abundant rumen core taxa. Thisglobal elevation of rumen microbes in the rectum microbiota suggests apotential impairment of rumen microbiota under fescue toxicosis. Among these,Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumenmicrobiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% infeces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7re-adapted to the large intestine environment under toxic fescue stress,causing this dramatic increase in abundance. Functional enrichment analysisrevealed that the overrepresented pathways shifted from energy metabolism toantimicrobial resistance and DNA replication. In conclusion, we discovereddramatic microbiota alterations in composition, abundance, and functionalcapacities under fescue toxicosis, and our results suggest Ruminococcaceaebacterium P7 as a potential biomarker for fescue toxicosis management.

牧草中毒会导致放牧受内生菌感染的高羊茅的奶牛生长和繁殖能力下降。为了描述肠道微生物群及其对羊茅中毒的反应，我们收集了八头安格斯西门塔尔怀孕母牛和小母牛在补充30天有毒羊茅种子前后的粪便样本。我们采用全基因组射枪法对 16 个元基因组进行了测序，并生成了 157 Gbp 的元基因组序列。通过重新组合和注释，我们获得了 13.1 Gbp 的参考序列，并为牛直肠微生物群鉴定了 2200 万个微生物基因。我们发现，经过有毒种子处理后，微生物多样性明显减少（P<0.01），这表明微生物组出现了菌群失调。粪便微生物群中有 6 个细菌科和 31 个物种明显增加（P-adj<0.05），其中包括瘤胃中含量最高的核心类群成员。直肠微生物群中瘤胃微生物的这种全球性增加表明，羊草中毒可能会损害瘤胃微生物群。其中，占瘤胃微生物群约 2% 的重要物种反刍球菌 P7 受到的影响最大，从 0.17% 增加到 2.8% （P<0.01），增加了 16 倍。我们假设，瘤胃反刍球菌 P7 在毒羊茅胁迫下适应了大肠环境，导致丰度急剧增加。功能富集分析表明，代表性过高的通路从能量代谢转向抗微生物和 DNA 复制。总之，我们发现在羊茅中毒中微生物群的组成、丰度和功能能力发生了剧烈变化，我们的研究结果表明反刍球菌 P7 是羊茅中毒管理的潜在生物标记物。

{"title":"Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis","authors":"Yihang Zhou","doi":"arxiv-2407.05055","DOIUrl":"https://doi.org/arxiv-2407.05055","url":null,"abstract":"Fescue toxicity causes reduced growth and reproductive issues in cattle\u0000grazing endophyte-infected tall fescue. To characterize the gut microbiota and\u0000its response to fescue toxicosis, we collected fecal samples before and after a\u000030-days toxic fescue seeds supplementation from eight Angus Simmental pregnant\u0000cows and heifers. We sequenced the 16 metagenomes using the whole-genome\u0000shotgun approach and generated 157 Gbp of metagenomic sequences. Through de\u0000novo assembly and annotation, we obtained a 13.1 Gbp reference contig assembly\u0000and identified 22 million microbial genes for cattle rectum microbiota. We\u0000discovered a significant reduction of microbial diversity after toxic seed\u0000treatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterial\u0000families and 31 species are significantly increased in the fecal microbiota\u0000(P-adj<0.05), including members of the top abundant rumen core taxa. This\u0000global elevation of rumen microbes in the rectum microbiota suggests a\u0000potential impairment of rumen microbiota under fescue toxicosis. Among these,\u0000Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumen\u0000microbiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% in\u0000feces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7\u0000re-adapted to the large intestine environment under toxic fescue stress,\u0000causing this dramatic increase in abundance. Functional enrichment analysis\u0000revealed that the overrepresented pathways shifted from energy metabolism to\u0000antimicrobial resistance and DNA replication. In conclusion, we discovered\u0000dramatic microbiota alterations in composition, abundance, and functional\u0000capacities under fescue toxicosis, and our results suggest Ruminococcaceae\u0000bacterium P7 as a potential biomarker for fescue toxicosis management.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery Dy-mer：使用稀疏恢复的可解释 DNA 序列表示方案

arXiv - QuanBio - Genomics

Pub Date : 2024-07-06 DOI: arxiv-2407.12051

Zhiyuan Peng, Yuanbo Tang, Yang Li

DNA sequences encode vital genetic and biological information, yet theseunfixed-length sequences cannot serve as the input of common data miningalgorithms. Hence, various representation schemes have been developed totransform DNA sequences into fixed-length numerical representations. However,these schemes face difficulties in learning high-quality representations due tothe complexity and sparsity of DNA data. Additionally, DNA sequences areinherently noisy because of mutations. While several schemes have been proposedfor their effectiveness, they often lack semantic structure, making itdifficult for biologists to validate and leverage the results. To address thesechallenges, we propose textbf{Dy-mer}, an explainable and robust DNArepresentation scheme based on sparse recovery. Leveraging the underlyingsemantic structure of DNA, we modify the traditional sparse recovery to capturerecurring patterns indicative of biological functions by representing frequentK-mers as basis vectors and reconstructing each DNA sequence through simpleconcatenation. Experimental results demonstrate that textbf{Dy-mer} achievesstate-of-the-art performance in DNA promoter classification, yielding aremarkable textbf{13%} increase in accuracy. Moreover, its inherentexplainability facilitates DNA clustering and motif detection, enhancing itsutility in biological research.

DNA 序列编码着重要的遗传和生物信息，但这些长度不固定的序列无法作为普通数据挖掘算法的输入。因此，人们开发了各种表示方案，将 DNA 序列转换为固定长度的数字表示。然而，由于 DNA 数据的复杂性和稀疏性，这些方案在学习高质量表示时面临困难。此外，由于突变，DNA 序列本身就存在噪声。虽然已经提出了几种有效的方案，但它们往往缺乏语义结构，使得生物学家难以验证和利用这些结果。为了应对这些挑战，我们提出了一种基于稀疏恢复的可解释且稳健的 DNA 表示方案--textbf{Dy-mer}。利用DNA的基本语义结构，我们修改了传统的稀疏恢复方法，将频繁出现的K-mers表示为基向量，并通过简单的合并重建每个DNA序列，从而捕捉到表明生物功能的重复出现的模式。实验结果表明，textbf{Dy-mer}在DNA启动子分类中达到了最先进的性能，准确率提高了13%。此外，其固有的可解释性也为DNA聚类和主题检测提供了便利，提高了其在生物学研究中的实用性。

{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":"https://doi.org/arxiv-2407.12051","url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\u0000unfixed-length sequences cannot serve as the input of common data mining\u0000algorithms. Hence, various representation schemes have been developed to\u0000transform DNA sequences into fixed-length numerical representations. However,\u0000these schemes face difficulties in learning high-quality representations due to\u0000the complexity and sparsity of DNA data. Additionally, DNA sequences are\u0000inherently noisy because of mutations. While several schemes have been proposed\u0000for their effectiveness, they often lack semantic structure, making it\u0000difficult for biologists to validate and leverage the results. To address these\u0000challenges, we propose textbf{Dy-mer}, an explainable and robust DNA\u0000representation scheme based on sparse recovery. Leveraging the underlying\u0000semantic structure of DNA, we modify the traditional sparse recovery to capture\u0000recurring patterns indicative of biological functions by representing frequent\u0000K-mers as basis vectors and reconstructing each DNA sequence through simple\u0000concatenation. Experimental results demonstrate that textbf{Dy-mer} achieves\u0000state-of-the-art performance in DNA promoter classification, yielding a\u0000remarkable textbf{13%} increase in accuracy. Moreover, its inherent\u0000explainability facilitates DNA clustering and motif detection, enhancing its\u0000utility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantically Rich Local Dataset Generation for Explainable AI in Genomics 为基因组学中的可解释人工智能生成语义丰富的本地数据集

arXiv - QuanBio - Genomics

Pub Date : 2024-07-03 DOI: arxiv-2407.02984

Pedro Barbosa, Rosina Savisaar, Alcides Fonseca

Black box deep learning models trained on genomic sequences excel atpredicting the outcomes of different gene regulatory mechanisms. Therefore,interpreting these models may provide novel insights into the underlyingbiology, supporting downstream biomedical applications. Due to theircomplexity, interpretable surrogate models can only be built for localexplanations (e.g., a single instance). However, accomplishing this requiresgenerating a dataset in the neighborhood of the input, which must maintainsyntactic similarity to the original data while introducing semanticvariability in the model's predictions. This task is challenging due to thecomplex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolvingperturbations in sequences that contribute to their semantic diversity. Ourcustom, domain-guided individual representation effectively constrainssyntactic similarity, and we provide two alternative fitness functions thatpromote diversity with no computational effort. Applied to the RNA splicingdomain, our approach quickly achieves good diversity and significantlyoutperforms a random baseline in exploring the search space, as shown by ourproof-of-concept, short RNA sequence. Furthermore, we assess itsgeneralizability and demonstrate scalability to larger sequences, resulting ina $approx$30% improvement over the baseline.

在基因组序列上训练的黑盒深度学习模型擅长预测不同基因调控机制的结果。因此，对这些模型进行解释可以提供对潜在生物学的新见解，从而支持下游的生物医学应用。由于其复杂性，可解释的代用模型只能针对局部解释（如单个实例）建立。然而，要做到这一点，需要在输入数据的邻域生成一个数据集，该数据集必须与原始数据保持句法相似性，同时在模型预测中引入语义可变性。由于 DNA 的序列与功能之间的关系非常复杂，因此这项任务极具挑战性。我们建议使用遗传编程法（Genetic Programming），通过对序列的扰动来生成数据集，从而促进数据集的语义多样性。我们定制的、以领域为导向的个体表示法有效地约束了句法相似性，我们还提供了两种可供选择的适合度函数，它们能在不增加计算工作量的情况下促进多样性。应用于 RNA 剪接领域，我们的方法很快就实现了良好的多样性，并且在探索搜索空间方面明显优于随机基线，正如我们的概念验证短 RNA 序列所显示的那样。此外，我们还评估了它的通用性，并证明了它对更大序列的可扩展性，结果比基线提高了大约30%。

{"title":"Semantically Rich Local Dataset Generation for Explainable AI in Genomics","authors":"Pedro Barbosa, Rosina Savisaar, Alcides Fonseca","doi":"arxiv-2407.02984","DOIUrl":"https://doi.org/arxiv-2407.02984","url":null,"abstract":"Black box deep learning models trained on genomic sequences excel at\u0000predicting the outcomes of different gene regulatory mechanisms. Therefore,\u0000interpreting these models may provide novel insights into the underlying\u0000biology, supporting downstream biomedical applications. Due to their\u0000complexity, interpretable surrogate models can only be built for local\u0000explanations (e.g., a single instance). However, accomplishing this requires\u0000generating a dataset in the neighborhood of the input, which must maintain\u0000syntactic similarity to the original data while introducing semantic\u0000variability in the model's predictions. This task is challenging due to the\u0000complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving\u0000perturbations in sequences that contribute to their semantic diversity. Our\u0000custom, domain-guided individual representation effectively constrains\u0000syntactic similarity, and we provide two alternative fitness functions that\u0000promote diversity with no computational effort. Applied to the RNA splicing\u0000domain, our approach quickly achieves good diversity and significantly\u0000outperforms a random baseline in exploring the search space, as shown by our\u0000proof-of-concept, short RNA sequence. Furthermore, we assess its\u0000generalizability and demonstrate scalability to larger sequences, resulting in\u0000a $approx$30% improvement over the baseline.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing SeqMate：用于自动化 RNA 测序的新型大型语言模型管道

arXiv - QuanBio - Genomics

Pub Date : 2024-07-02 DOI: arxiv-2407.03381

Devam Mondal, Atharva Inamdar

RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq,are critical tools for the biologist looking to analyze the geneticactivity/transcriptome of a tissue or cell during an experimental procedure.Platforms like Illumina's next-generation sequencing (NGS) are used to producethe raw data for this experimental procedure. This raw FASTQ data must then beprepared via a complex series of data manipulations by bioinformaticians. Thisprocess currently takes place on an unwieldy textual user interface like aterminal/command line that requires the user to install and import multipleprogram packages, preventing the untrained biologist from initiating dataanalysis. Open-source platforms like Galaxy have produced a more user-friendlypipeline, yet the visual interface remains cluttered and highly technical,remaining uninviting for the natural scientist. To address this, SeqMate is auser-friendly tool that allows for one-click analytics by utilizing the powerof a large language model (LLM) to automate both data preparation and analysis(differential expression, trajectory analysis, etc). Furthermore, by utilizingthe power of generative AI, SeqMate is also capable of analyzing such findingsand producing written reports of upregulated/downregulated/user-prompted geneswith sources cited from known repositories like PubMed, PDB, and Uniprot.

对于希望在实验过程中分析组织或细胞遗传活性/转录组的生物学家来说，RNA 测序技术，如批量 RNA-seq 和单细胞 (sc) RNA-seq 是至关重要的工具。然后，生物信息学家必须对这些原始 FASTQ 数据进行一系列复杂的数据处理。目前，这一过程是在类似于终端/命令行的笨重的文本用户界面上进行的，需要用户安装和导入多个程序包，使未经训练的生物学家无法开始数据分析。Galaxy 等开源平台提供了更加友好的用户界面，但其可视化界面仍然杂乱无章，技术性很强，对自然科学家来说缺乏吸引力。为了解决这个问题，SeqMate 是一款用户友好型工具，它利用大型语言模型（LLM）的强大功能，自动进行数据准备和分析（差异表达、轨迹分析等），从而实现一键式分析。此外，SeqMate 还能利用生成式人工智能的强大功能来分析这些发现，并生成上调/下调/用户提示基因的书面报告，报告的来源可从 PubMed、PDB 和 Uniprot 等已知资源库中引用。

{"title":"SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing","authors":"Devam Mondal, Atharva Inamdar","doi":"arxiv-2407.03381","DOIUrl":"https://doi.org/arxiv-2407.03381","url":null,"abstract":"RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq,\u0000are critical tools for the biologist looking to analyze the genetic\u0000activity/transcriptome of a tissue or cell during an experimental procedure.\u0000Platforms like Illumina's next-generation sequencing (NGS) are used to produce\u0000the raw data for this experimental procedure. This raw FASTQ data must then be\u0000prepared via a complex series of data manipulations by bioinformaticians. This\u0000process currently takes place on an unwieldy textual user interface like a\u0000terminal/command line that requires the user to install and import multiple\u0000program packages, preventing the untrained biologist from initiating data\u0000analysis. Open-source platforms like Galaxy have produced a more user-friendly\u0000pipeline, yet the visual interface remains cluttered and highly technical,\u0000remaining uninviting for the natural scientist. To address this, SeqMate is a\u0000user-friendly tool that allows for one-click analytics by utilizing the power\u0000of a large language model (LLM) to automate both data preparation and analysis\u0000(differential expression, trajectory analysis, etc). Furthermore, by utilizing\u0000the power of generative AI, SeqMate is also capable of analyzing such findings\u0000and producing written reports of upregulated/downregulated/user-prompted genes\u0000with sources cited from known repositories like PubMed, PDB, and Uniprot.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences CGRclust：用于无标记 DNA 序列孪生对比聚类的混沌博弈表示法

arXiv - QuanBio - Genomics

Pub Date : 2024-07-01 DOI: arxiv-2407.02538

Fatemeh Alipour, Kathleen A. Hill, Lila Kari

This study proposes CGRclust, a novel combination of unsupervised twincontrastive clustering of Chaos Game Representations (CGR) of DNA sequences,with convolutional neural networks (CNNs). To the best of our knowledge,CGRclust is the first method to use unsupervised learning for imageclassification (herein applied to two-dimensional CGR images) for clusteringdatasets of DNA sequences. CGRclust overcomes the limitations of traditionalsequence classification methods by leveraging unsupervised twin contrastivelearning to detect distinctive sequence patterns, without requiring DNAsequence alignment or biological/taxonomic labels. CGRclust accuratelyclustered twenty-five diverse datasets, with sequence lengths ranging from 664bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, aswell as viral whole genome assemblies and synthetic DNA sequences. Comparedwith three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, andMeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracyacross all four taxonomic levels tested for mitochondrial DNA genomes of fish.Moreover, CGRclust also consistently demonstrates superior performance acrossall the viral genomic datasets. The high clustering accuracy of CGRclust onthese twenty-five datasets, which vary significantly in terms of sequencelength, number of genomes, number of clusters, and level of taxonomy,demonstrates its robustness, scalability, and versatility.

本研究提出了 CGRclust，这是一种将 DNA 序列的混沌博弈表示（CGR）的无监督双对比聚类与卷积神经网络（CNN）相结合的新方法。据我们所知，CGRclust 是第一种使用无监督学习进行图像分类的方法（此处应用于二维 CGR 图像），用于对 DNA 序列数据集进行聚类。CGRclust 克服了传统序列分类方法的局限性，利用无监督双对比学习来检测独特的序列模式，而不需要 DNA 序列比对或生物/分类标签。CGRclust 对 25 个不同的数据集进行了精确聚类，序列长度从 664bp 到 100 kbp 不等，包括鱼类、真菌和原生动物的线粒体基因组，以及病毒全基因组组装和合成 DNA 序列。与三种最新的DNA序列聚类方法（DeLUCS、iDeLUCS和MeShClust v3.0）相比，CGRclust是唯一一种在鱼类线粒体DNA基因组的所有四个分类水平测试中准确率都超过81.70%的方法。CGRclust 在这 25 个数据集上的聚类准确率很高，而这 25 个数据集在序列长度、基因组数量、聚类数量和分类级别上都有很大差异，这证明了 CGRclust 的鲁棒性、可扩展性和多功能性。

{"title":"CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences","authors":"Fatemeh Alipour, Kathleen A. Hill, Lila Kari","doi":"arxiv-2407.02538","DOIUrl":"https://doi.org/arxiv-2407.02538","url":null,"abstract":"This study proposes CGRclust, a novel combination of unsupervised twin\u0000contrastive clustering of Chaos Game Representations (CGR) of DNA sequences,\u0000with convolutional neural networks (CNNs). To the best of our knowledge,\u0000CGRclust is the first method to use unsupervised learning for image\u0000classification (herein applied to two-dimensional CGR images) for clustering\u0000datasets of DNA sequences. CGRclust overcomes the limitations of traditional\u0000sequence classification methods by leveraging unsupervised twin contrastive\u0000learning to detect distinctive sequence patterns, without requiring DNA\u0000sequence alignment or biological/taxonomic labels. CGRclust accurately\u0000clustered twenty-five diverse datasets, with sequence lengths ranging from 664\u0000bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as\u0000well as viral whole genome assemblies and synthetic DNA sequences. Compared\u0000with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and\u0000MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy\u0000across all four taxonomic levels tested for mitochondrial DNA genomes of fish.\u0000Moreover, CGRclust also consistently demonstrates superior performance across\u0000all the viral genomic datasets. The high clustering accuracy of CGRclust on\u0000these twenty-five datasets, which vary significantly in terms of sequence\u0000length, number of genomes, number of clusters, and level of taxonomy,\u0000demonstrates its robustness, scalability, and versatility.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing MegIS：利用存储内处理技术进行高性能、高能效、低成本的元基因组分析

arXiv - QuanBio - Genomics

Pub Date : 2024-06-27 DOI: arxiv-2406.19113

Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu

Metagenomics has led to significant advances in many fields. Metagenomicanalysis commonly involves the key tasks of determining the species present ina sample and their relative abundances. These tasks require searching largemetagenomic databases. Metagenomic analysis suffers from significant datamovement overhead due to moving large amounts of low-reuse data from thestorage system. In-storage processing can be a fundamental solution forreducing this overhead. However, designing an in-storage processing system formetagenomics is challenging because existing approaches to metagenomic analysiscannot be directly implemented in storage effectively due to the hardwarelimitations of modern SSDs. We propose MegIS, the first in-storage processingsystem designed to significantly reduce the data movement overhead of theend-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweightdesign that effectively leverages and orchestrates processing inside andoutside the storage system. We address in-storage processing challenges formetagenomics via specialized and efficient 1) task partitioning, 2)data/computation flow coordination, 3) storage technology-aware algorithmicoptimizations, 4) data mapping, and 5) lightweight in-storage accelerators.MegIS's design is flexible, capable of supporting different types ofmetagenomic input datasets, and can be integrated into various metagenomicanalysis pipelines. Our evaluation shows that MegIS outperforms thestate-of-the-art performance- and accuracy-optimized software metagenomic toolsby 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, whilematching the accuracy of the accuracy-optimized tool. MegIS achieves1.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomichardware-accelerated (using processing-in-memory) tool, while achievingsignificantly higher accuracy.

元基因组学在许多领域都取得了重大进展。元基因组分析通常涉及确定样本中存在的物种及其相对丰度等关键任务。这些任务需要搜索大型的元基因组数据库。由于要从存储系统中移动大量低重复利用率的数据，元基因组分析需要大量的数据移动开销。存储内处理可以从根本上减少这种开销。然而，由于现代固态硬盘的硬件限制，现有的元基因组分析方法无法直接有效地在存储系统中实现，因此设计存储内处理系统具有挑战性。我们提出的 MegIS 是首个存储内处理系统，旨在显著减少端到端元基因组分析管道的数据移动开销。我们的轻量级设计有效地利用和协调了存储系统内部和外部的处理，从而使 MegIS 得以实现。我们通过专门而高效的1）任务分区、2）数据/计算流协调、3）存储技术感知算法优化、4）数据映射和5）轻量级存储内加速器，解决了元基因组学面临的存储内处理难题。MegIS的设计非常灵活，能够支持不同类型的元基因组输入数据集，并可集成到各种元基因组分析流水线中。我们的评估结果表明，MegIS的性能和准确性分别比最先进的性能优化软件元基因组工具高出2.7倍-37.2倍和6.9倍-100.2倍，而准确性则与准确性优化工具相当。与最先进的元基因组硬件加速（使用内存处理）工具相比，MegIS 的速度提高了 1.5 倍-5.1 倍，同时准确率也显著提高。

{"title":"MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing","authors":"Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu","doi":"arxiv-2406.19113","DOIUrl":"https://doi.org/arxiv-2406.19113","url":null,"abstract":"Metagenomics has led to significant advances in many fields. Metagenomic\u0000analysis commonly involves the key tasks of determining the species present in\u0000a sample and their relative abundances. These tasks require searching large\u0000metagenomic databases. Metagenomic analysis suffers from significant data\u0000movement overhead due to moving large amounts of low-reuse data from the\u0000storage system. In-storage processing can be a fundamental solution for\u0000reducing this overhead. However, designing an in-storage processing system for\u0000metagenomics is challenging because existing approaches to metagenomic analysis\u0000cannot be directly implemented in storage effectively due to the hardware\u0000limitations of modern SSDs. We propose MegIS, the first in-storage processing\u0000system designed to significantly reduce the data movement overhead of the\u0000end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight\u0000design that effectively leverages and orchestrates processing inside and\u0000outside the storage system. We address in-storage processing challenges for\u0000metagenomics via specialized and efficient 1) task partitioning, 2)\u0000data/computation flow coordination, 3) storage technology-aware algorithmic\u0000optimizations, 4) data mapping, and 5) lightweight in-storage accelerators.\u0000MegIS's design is flexible, capable of supporting different types of\u0000metagenomic input datasets, and can be integrated into various metagenomic\u0000analysis pipelines. Our evaluation shows that MegIS outperforms the\u0000state-of-the-art performance- and accuracy-optimized software metagenomic tools\u0000by 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, while\u0000matching the accuracy of the accuracy-optimized tool. MegIS achieves\u00001.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomic\u0000hardware-accelerated (using processing-in-memory) tool, while achieving\u0000significantly higher accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online t-SNE for single-cell RNA-seq 用于单细胞 RNA-seq 的在线 t-SNE

arXiv - QuanBio - Genomics

Pub Date : 2024-06-21 DOI: arxiv-2406.14842

Hui Ma, Kai Chen

Due to the sequential sample arrival, changing experiment conditions, andevolution of knowledge, the demand to continually visualize evolving structuresof sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomesindispensable. However, as one of the state-of-the-art visualization andanalysis methods for scRNA-seq, t-distributed stochastic neighbor embedding(t-SNE) merely visualizes static scRNA-seq data offline and fails to meet thedemand well. To address these challenges, we introduce online t-SNE toseamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this byleveraging the embedding space of old samples, exploring the embedding space ofnew samples, and aligning the two embedding spaces on the fly. Consequently,online t-SNE dramatically enables the continual discovery of new structures andhigh-quality visualization of new scRNA-seq data without retraining fromscratch. We showcase the formidable visualization capabilities of online t-SNEacross diverse sequential scRNA-seq datasets.

由于样本的连续到达、实验条件的变化以及知识的发展，对连续、多样的单细胞RNA测序（scRNA-sequencing，scRNA-seq）数据不断演化的结构进行可视化的需求变得不可或缺。然而，作为最先进的 scRNA-seq 可视化和分析方法之一，t-分布随机邻域嵌入（t-SNE）只能离线可视化静态 scRNA-seq 数据，不能很好地满足需求。为了应对这些挑战，我们引入了在线 t-SNE，以无缝整合连续的 scRNA-seq 数据。在线 t-SNE 通过充分利用旧样本的嵌入空间，探索新样本的嵌入空间，并对这两个嵌入空间进行实时对齐来实现这一目标。因此，在线 t-SNE 极大地促进了新结构的不断发现和新 scRNA-seq 数据的高质量可视化，而无需从头开始重新训练。我们展示了在线 t-SNE 跨各种连续 scRNA-seq 数据集的强大可视化能力。

引用次数: 0

GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians GenoTEX：与生物信息学家一起评估基于 LLM 的基因表达数据对齐探索的基准工具

arXiv - QuanBio - Genomics

Pub Date : 2024-06-21 DOI: arxiv-2406.15341

Haoyang Liu, Haohan Wang

Recent advancements in machine learning have significantly improved theidentification of disease-associated genes from gene expression datasets.However, these processes often require extensive expertise and manual effort,limiting their scalability. Large Language Model (LLM)-based agents have shownpromise in automating these tasks due to their increasing problem-solvingabilities. To support the evaluation and development of such methods, weintroduce GenoTEX, a benchmark dataset for the automatic exploration of geneexpression data, involving the tasks of dataset selection, preprocessing, andstatistical analysis. GenoTEX provides annotated code and results for solving awide range of gene identification problems, in a full analysis pipeline thatfollows the standard of computational genomics. These annotations are curatedby human bioinformaticians who carefully analyze the datasets to ensureaccuracy and reliability. To provide baselines for these tasks, we presentGenoAgents, a team of LLM-based agents designed with context-aware planning,iterative correction, and domain expert consultation to collaboratively exploregene datasets. Our experiments with GenoAgents demonstrate the potential ofLLM-based approaches in genomics data analysis, while error analysis highlightsthe challenges and areas for future improvement. We propose GenoTEX as apromising resource for benchmarking and enhancing AI-driven methods forgenomics data analysis. We make our benchmark publicly available aturl{https://github.com/Liu-Hy/GenoTex}.

机器学习领域的最新进展大大提高了从基因表达数据集中识别疾病相关基因的能力。然而，这些过程往往需要大量的专业知识和人工操作，限制了其可扩展性。由于基于大语言模型（LLM）的代理解决问题的能力越来越强，因此在自动完成这些任务方面大有可为。为了支持此类方法的评估和开发，我们引入了 GenoTEX，这是一个用于自动探索基因表达数据的基准数据集，涉及数据集选择、预处理和统计分析等任务。GenoTEX 按照计算基因组学的标准，在一个完整的分析流水线中提供了用于解决各种基因识别问题的注释代码和结果。这些注释是由人类生物信息学家策划的，他们会仔细分析数据集，以确保准确性和可靠性。为了给这些任务提供基线，我们提出了 GenoAgents，这是一个基于 LLM 的代理团队，其设计具有上下文感知规划、迭代校正和领域专家咨询功能，可以协同探索基因数据集。我们用 GenoAgents 进行的实验证明了基于 LLM 的方法在基因组学数据分析中的潜力，而误差分析则凸显了所面临的挑战和未来需要改进的地方。我们建议将 GenoTEX 作为一个令人兴奋的资源，用于为基因组学数据分析制定基准并改进人工智能驱动的方法。我们在（url{https://github.com/Liu-Hy/GenoTex}）上公开了我们的基准。

{"title":"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":"https://doi.org/arxiv-2406.15341","url":null,"abstract":"Recent advancements in machine learning have significantly improved the\u0000identification of disease-associated genes from gene expression datasets.\u0000However, these processes often require extensive expertise and manual effort,\u0000limiting their scalability. Large Language Model (LLM)-based agents have shown\u0000promise in automating these tasks due to their increasing problem-solving\u0000abilities. To support the evaluation and development of such methods, we\u0000introduce GenoTEX, a benchmark dataset for the automatic exploration of gene\u0000expression data, involving the tasks of dataset selection, preprocessing, and\u0000statistical analysis. GenoTEX provides annotated code and results for solving a\u0000wide range of gene identification problems, in a full analysis pipeline that\u0000follows the standard of computational genomics. These annotations are curated\u0000by human bioinformaticians who carefully analyze the datasets to ensure\u0000accuracy and reliability. To provide baselines for these tasks, we present\u0000GenoAgents, a team of LLM-based agents designed with context-aware planning,\u0000iterative correction, and domain expert consultation to collaboratively explore\u0000gene datasets. Our experiments with GenoAgents demonstrate the potential of\u0000LLM-based approaches in genomics data analysis, while error analysis highlights\u0000the challenges and areas for future improvement. We propose GenoTEX as a\u0000promising resource for benchmarking and enhancing AI-driven methods for\u0000genomics data analysis. We make our benchmark publicly available at\u0000url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0