arXiv - QuanBio - Genomics最新文献_第8页

LangCell: Language-Cell Pre-training for Cell Identity Understanding LangCell：理解细胞特性的语言-细胞预培训

arXiv - QuanBio - Genomics

Pub Date : 2024-05-09 DOI: arxiv-2405.06708

Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie

Cell identity encompasses various semantic aspects of a cell, including celltype, pathway information, disease information, and more, which are essentialfor biologists to gain insights into its biological characteristics.Understanding cell identity from the transcriptomic data, such as annotatingcell types, have become an important task in bioinformatics. As these semanticaspects are determined by human experts, it is impossible for AI models toeffectively carry out cell identity understanding tasks without the supervisionsignals provided by single-cell and label pairs. The single-cell pre-trainedlanguage models (PLMs) currently used for this task are trained only on asingle modality, transcriptomics data, lack an understanding of cell identityknowledge. As a result, they have to be fine-tuned for downstream tasks andstruggle when lacking labeled data with the desired semantic labels. To addressthis issue, we propose an innovative solution by constructing a unifiedrepresentation of single-cell data and natural language during the pre-trainingphase, allowing the model to directly incorporate insights related to cellidentity. More specifically, we introduce textbf{LangCell}, the firsttextbf{Lang}uage-textbf{Cell} pre-training framework. LangCell utilizes textsenriched with cell identity information to gain a profound comprehension ofcross-modal knowledge. Results from experiments conducted on differentbenchmarks show that LangCell is the only single-cell PLM that can workeffectively in zero-shot cell identity understanding scenarios, and alsosignificantly outperforms existing models in few-shot and fine-tuning cellidentity understanding scenarios.

细胞身份涵盖了细胞的各种语义方面，包括细胞类型、通路信息、疾病信息等，这些对于生物学家深入了解细胞的生物学特性至关重要。从转录组数据中理解细胞身份，如标注细胞类型，已成为生物信息学的一项重要任务。从转录组数据中理解细胞身份（如标注细胞类型）已成为生物信息学中的重要任务。由于这些语义方面由人类专家决定，人工智能模型不可能在没有单细胞和标签对提供的监督信号的情况下有效执行细胞身份理解任务。目前用于该任务的单细胞预训练语言模型（PLM）仅在单模态转录组学数据上进行训练，缺乏对细胞身份知识的理解。因此，这些模型必须针对下游任务进行微调，而且在缺乏具有所需语义标签的标注数据时会遇到困难。为了解决这个问题，我们提出了一种创新的解决方案，即在预训练阶段构建单细胞数据和自然语言的统一表述，使模型能够直接纳入与细胞身份相关的见解。更具体地说，我们引入了第一个预训练框架--textbf{Lang}uage-textbf{Cell}。LangCell 利用富含细胞身份信息的文本来深刻理解跨模态知识。在不同基准上进行的实验结果表明，LangCell是唯一能在零次细胞身份理解场景中有效工作的单细胞PLM，而且在少次细胞身份理解和微调细胞身份理解场景中的表现也明显优于现有模型。

{"title":"LangCell: Language-Cell Pre-training for Cell Identity Understanding","authors":"Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, Zaiqing Nie","doi":"arxiv-2405.06708","DOIUrl":"https://doi.org/arxiv-2405.06708","url":null,"abstract":"Cell identity encompasses various semantic aspects of a cell, including cell\u0000type, pathway information, disease information, and more, which are essential\u0000for biologists to gain insights into its biological characteristics.\u0000Understanding cell identity from the transcriptomic data, such as annotating\u0000cell types, have become an important task in bioinformatics. As these semantic\u0000aspects are determined by human experts, it is impossible for AI models to\u0000effectively carry out cell identity understanding tasks without the supervision\u0000signals provided by single-cell and label pairs. The single-cell pre-trained\u0000language models (PLMs) currently used for this task are trained only on a\u0000single modality, transcriptomics data, lack an understanding of cell identity\u0000knowledge. As a result, they have to be fine-tuned for downstream tasks and\u0000struggle when lacking labeled data with the desired semantic labels. To address\u0000this issue, we propose an innovative solution by constructing a unified\u0000representation of single-cell data and natural language during the pre-training\u0000phase, allowing the model to directly incorporate insights related to cell\u0000identity. More specifically, we introduce textbf{LangCell}, the first\u0000textbf{Lang}uage-textbf{Cell} pre-training framework. LangCell utilizes texts\u0000enriched with cell identity information to gain a profound comprehension of\u0000cross-modal knowledge. Results from experiments conducted on different\u0000benchmarks show that LangCell is the only single-cell PLM that can work\u0000effectively in zero-shot cell identity understanding scenarios, and also\u0000significantly outperforms existing models in few-shot and fine-tuning cell\u0000identity understanding scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"189 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity 微生物群栖息地特异性中基因相互作用效应的全基因组转化器

arXiv - QuanBio - Genomics

Pub Date : 2024-05-09 DOI: arxiv-2405.05998

Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus

Leveraging the vast genetic diversity within microbiomes offers unparalleledinsights into complex phenotypes, yet the task of accurately predicting andunderstanding such traits from genomic data remains challenging. We propose aframework taking advantage of existing large models for gene vectorization topredict habitat specificity from entire microbial genome sequences. Based onour model, we develop attribution techniques to elucidate gene interactioneffects that drive microbial adaptation to diverse environments. We train andvalidate our approach on a large dataset of high quality microbiome genomesfrom different habitats. We not only demonstrate solid predictive performance,but also how sequence-level information of entire genomes allows us to identifygene associations underlying complex phenotypes. Our attribution recovers knownimportant interaction networks and proposes new candidates for experimentalfollow up.

利用微生物组中巨大的遗传多样性可以对复杂的表型有无与伦比的洞察力，然而从基因组数据中准确预测和理解这些性状的任务仍然具有挑战性。我们提出了一个框架，利用现有的大型基因载体化模型，从整个微生物基因组序列中预测栖息地特异性。基于我们的模型，我们开发了归因技术，以阐明驱动微生物适应不同环境的基因相互作用效应。我们在来自不同栖息地的大量高质量微生物基因组数据集上训练和验证了我们的方法。我们不仅展示了可靠的预测性能，还展示了整个基因组的序列级信息如何让我们识别复杂表型背后的基因关联。我们的归因恢复了已知的重要相互作用网络，并为后续实验提出了新的候选者。

引用次数: 0

On the Coverage Required for Diploid Genome Assembly 二倍体基因组组装所需的覆盖率

arXiv - QuanBio - Genomics

Pub Date : 2024-05-09 DOI: arxiv-2405.05734

Daanish Mahajan, Chirag Jain, Navin Kashyap

We investigate the information-theoretic conditions to achieve the completereconstruction of a diploid genome. We also analyze the standard greedy andde-Bruijn graph-based algorithms and compare the coverage depth and read lengthrequirements with the information-theoretic lower bound. Our results show thatthe gap between the two is considerable because both algorithms require thedouble repeats in the genome to be bridged.

我们研究了实现完整重建二倍体基因组的信息论条件。我们还分析了标准的贪婪算法和德布鲁因图算法，并将覆盖深度和读长要求与信息理论下限进行了比较。结果表明，由于这两种算法都要求弥合基因组中的双重复序列，因此两者之间的差距相当大。

引用次数: 0

The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology 加拿大 VirusSeq 数据门户和 Duotang：SARS-CoV-2 病毒序列和基因组流行病学开放资源

arXiv - QuanBio - Genomics

Pub Date : 2024-05-08 DOI: arxiv-2405.04734

Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman

The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2genomes from patient samples to track viral evolution and inform public healthresponse. Millions of SARS-CoV-2 genome sequences have been deposited in globalpublic repositories. The Canadian COVID-19 Genomics Network (CanCOGeN -VirusSeq), a consortium tasked with coordinating expanded sequencing ofSARS-CoV-2 genomes across Canada early in the pandemic, created the CanadianVirusSeq Data Portal, with associated data pipelines and procedures, to supportthese efforts. The goal of VirusSeq was to allow open access to CanadianSARS-CoV-2 genomic sequences and enhanced, standardized contextual data thatwere unavailable in other repositories and that meet FAIR standards (Findable,Accessible, Interoperable and Reusable). The Portal data submission pipelinecontains data quality checking procedures and appropriate acknowledgement ofdata generators that encourages collaboration. Here we also highlight Duotang,a web platform that presents genomic epidemiology and modeling analyses oncirculating and emerging SARS-CoV-2 variants in Canada. Duotang presentsdynamic changes in variant composition of SARS-CoV-2 in Canada and by province,estimates variant growth, and displays complementary interactivevisualizations, with a text overview of the current situation. The VirusSeqData Portal and Duotang resources, alongside additional analyses and resourcescomputed from the Portal (COVID-MVP, CoVizu), are all open-source and freelyavailable. Together, they provide an updated picture of SARS-CoV-2 evolution tospur scientific discussions, inform public discourse, and support communicationwith and within public health authorities. They also serve as a framework forother jurisdictions interested in open, collaborative sequence data sharing andanalyses.

在 COVID-19 大流行的推动下，全球开展了大规模的工作，对患者样本中的 SARS-CoV-2 基因组进行测序，以追踪病毒进化并为公共卫生应对措施提供信息。数百万个 SARS-CoV-2 基因组序列已存入全球公共资料库。加拿大 COVID-19 基因组学网络（CanCOGeN -VirusSeq）是一个联盟，其任务是在大流行早期协调扩大加拿大各地的 SARS-CoV-2 基因组测序工作，该联盟创建了加拿大病毒测序数据门户网站（CanadianVirusSeq Data Portal）以及相关的数据管道和程序，以支持这些工作。VirusSeq 的目标是允许公开访问加拿大 SARS-CoV-2 基因组序列和增强的标准化背景数据，这些数据在其他存储库中无法获得，并且符合 FAIR 标准（可查找、可访问、可互操作和可重复使用）。门户网站的数据提交流水线包含数据质量检查程序和对数据生成者的适当鸣谢，以鼓励合作。在此，我们还重点介绍了 "多堂"（Duotang）这一网络平台，该平台对加拿大正在流通和出现的 SARS-CoV-2 变种进行基因组流行病学和建模分析。Duotang 介绍了加拿大和各省 SARS-CoV-2 变体组成的动态变化，估计了变体的增长情况，并显示了互补的交互式可视化，同时还提供了当前情况的文字概览。VirusSeqData Portal 和 Duotang 资源以及由该 Portal 计算出的其他分析和资源（COVID-MVP、CoVizu）都是开源和免费的。它们共同提供了有关 SARS-CoV-2 演变的最新情况，以促进科学讨论，为公众讨论提供信息，并支持与公共卫生部门的沟通以及公共卫生部门内部的沟通。它们还为其他对开放、协作性序列数据共享和分析感兴趣的辖区提供了一个框架。

{"title":"The Canadian VirusSeq Data Portal & Duotang: open resources for SARS-CoV-2 viral sequences and genomic epidemiology","authors":"Erin E. Gill, Baofeng Jia, Carmen Lia Murall, Raphaël Poujol, Muhammad Zohaib Anwar, Nithu Sara John, Justin Richardsson, Ashley Hobb, Abayomi S. Olabode, Alexandru Lepsa, Ana T. Duggan, Andrea D. Tyler, Arnaud N'Guessan, Atul Kachru, Brandon Chan, Catherine Yoshida, Christina K. Yung, David Bujold, Dusan Andric, Edmund Su, Emma J. Griffiths, Gary Van Domselaar, Gordon W. Jolly, Heather K. E. Ward, Henrich Feher, Jared Baker, Jared T. Simpson, Jaser Uddin, Jiannis Ragoussis, Jon Eubank, Jörg H. Fritz, José Héctor Gálvez, Karen Fang, Kim Cullion, Leonardo Rivera, Linda Xiang, Matthew A. Croxen, Mitchell Shiell, Natalie Prystajecky, Pierre-Olivier Quirion, Rosita Bajari, Samantha Rich, Samira Mubareka, Sandrine Moreira, Scott Cain, Steven G. Sutcliffe, Susanne A. Kraemer, Yann Joly, Yelizar Alturmessov, CPHLN consortium, CanCOGeN consortium, VirusSeq Data Portal Academic, Health network, Marc Fiume, Terrance P. Snutch, Cindy Bell, Catalina Lopez-Correa, Julie G. Hussin, Jeffrey B. Joy, Caroline Colijn, Paul M. K. Gordon, William W. L. Hsiao, Art F. Y. Poon, Natalie C. Knox, Mélanie Courtot, Lincoln Stein, Sarah P. Otto, Guillaume Bourque, B. Jesse Shapiro, Fiona S. L. Brinkman","doi":"arxiv-2405.04734","DOIUrl":"https://doi.org/arxiv-2405.04734","url":null,"abstract":"The COVID-19 pandemic led to a large global effort to sequence SARS-CoV-2\u0000genomes from patient samples to track viral evolution and inform public health\u0000response. Millions of SARS-CoV-2 genome sequences have been deposited in global\u0000public repositories. The Canadian COVID-19 Genomics Network (CanCOGeN -\u0000VirusSeq), a consortium tasked with coordinating expanded sequencing of\u0000SARS-CoV-2 genomes across Canada early in the pandemic, created the Canadian\u0000VirusSeq Data Portal, with associated data pipelines and procedures, to support\u0000these efforts. The goal of VirusSeq was to allow open access to Canadian\u0000SARS-CoV-2 genomic sequences and enhanced, standardized contextual data that\u0000were unavailable in other repositories and that meet FAIR standards (Findable,\u0000Accessible, Interoperable and Reusable). The Portal data submission pipeline\u0000contains data quality checking procedures and appropriate acknowledgement of\u0000data generators that encourages collaboration. Here we also highlight Duotang,\u0000a web platform that presents genomic epidemiology and modeling analyses on\u0000circulating and emerging SARS-CoV-2 variants in Canada. Duotang presents\u0000dynamic changes in variant composition of SARS-CoV-2 in Canada and by province,\u0000estimates variant growth, and displays complementary interactive\u0000visualizations, with a text overview of the current situation. The VirusSeq\u0000Data Portal and Duotang resources, alongside additional analyses and resources\u0000computed from the Portal (COVID-MVP, CoVizu), are all open-source and freely\u0000available. Together, they provide an updated picture of SARS-CoV-2 evolution to\u0000spur scientific discussions, inform public discourse, and support communication\u0000with and within public health authorities. They also serve as a framework for\u0000other jurisdictions interested in open, collaborative sequence data sharing and\u0000analyses.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures sc-OTGM：通过求解高斯混合物平面上的最优质量输运建立单细胞扰动模型

arXiv - QuanBio - Genomics

Pub Date : 2024-05-06 DOI: arxiv-2405.03726

Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan

Influenced by breakthroughs in LLMs, single-cell foundation models areemerging. While these models show successful performance in cell typeclustering, phenotype classification, and gene perturbation responseprediction, it remains to be seen if a simpler model could achieve comparableor better results, especially with limited data. This is important, as thequantity and quality of single-cell data typically fall short of the standardsin textual data used for training LLMs. Single-cell sequencing often suffersfrom technical artifacts, dropout events, and batch effects. These challengesare compounded in a weakly supervised setting, where the labels of cell statescan be noisy, further complicating the analysis. To tackle these challenges, wepresent sc-OTGM, streamlined with less than 500K parameters, making itapproximately 100x more compact than the foundation models, offering anefficient alternative. sc-OTGM is an unsupervised model grounded in theinductive bias that the scRNAseq data can be generated from a combination ofthe finite multivariate Gaussian distributions. The core function of sc-OTGM isto create a probabilistic latent space utilizing a GMM as its priordistribution and distinguish between distinct cell populations by learningtheir respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler todetermine the OT plan across these PDFs within the GMM framework. We evaluatedour model against a CRISPR-mediated perturbation dataset, called CROP-seq,consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGMis effective in cell state classification, aids in the analysis of differentialgene expression, and ranks genes for target identification through arecommender system. It also predicts the effects of single-gene perturbationson downstream gene regulation and generates synthetic scRNA-seq dataconditioned on specific cell states.

受 LLMs 突破性进展的影响，单细胞基础模型正在兴起。虽然这些模型在细胞类型聚类、表型分类和基因扰动反应预测等方面取得了成功，但一个更简单的模型是否能取得类似或更好的结果，尤其是在数据有限的情况下，还有待观察。这一点很重要，因为单细胞数据的数量和质量通常达不到用于训练 LLM 的文本数据标准。单细胞测序常常受到技术伪影、丢失事件和批次效应的影响。在弱监督环境下，这些挑战变得更加复杂，因为细胞状态的标签可能存在噪声，从而使分析变得更加复杂。为了应对这些挑战，我们提出了 sc-OTGM，它的参数少于 500K，比基础模型精简了约 100 倍，提供了一种高效的替代方法。sc-OTGM 是一种无监督模型，基于 scRNAseq 数据可以从有限多元高斯分布的组合中生成这一诱导偏差。sc-OTGM 的核心功能是利用 GMM 作为其先验分布来创建一个概率潜空间，并通过学习各自的边际 PDF 来区分不同的细胞群。它使用 "命中运行马尔可夫链采样器"（Hit-and-Run Markov Chain sampler）在 GMM 框架内确定这些边际前值的 OT 计划。我们用 CRISPR 介导的扰动数据集（CROP-seq）评估了我们的模型，该数据集由 57 个单基因扰动组成。结果表明，sc-OTGM 能有效地进行细胞状态分类，帮助分析差异基因的表达，并通过推荐系统对基因进行排序以确定目标。它还能预测单基因扰动对下游基因调控的影响，并生成以特定细胞状态为条件的合成 scRNA-seq 数据。

{"title":"sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures","authors":"Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan","doi":"arxiv-2405.03726","DOIUrl":"https://doi.org/arxiv-2405.03726","url":null,"abstract":"Influenced by breakthroughs in LLMs, single-cell foundation models are\u0000emerging. While these models show successful performance in cell type\u0000clustering, phenotype classification, and gene perturbation response\u0000prediction, it remains to be seen if a simpler model could achieve comparable\u0000or better results, especially with limited data. This is important, as the\u0000quantity and quality of single-cell data typically fall short of the standards\u0000in textual data used for training LLMs. Single-cell sequencing often suffers\u0000from technical artifacts, dropout events, and batch effects. These challenges\u0000are compounded in a weakly supervised setting, where the labels of cell states\u0000can be noisy, further complicating the analysis. To tackle these challenges, we\u0000present sc-OTGM, streamlined with less than 500K parameters, making it\u0000approximately 100x more compact than the foundation models, offering an\u0000efficient alternative. sc-OTGM is an unsupervised model grounded in the\u0000inductive bias that the scRNAseq data can be generated from a combination of\u0000the finite multivariate Gaussian distributions. The core function of sc-OTGM is\u0000to create a probabilistic latent space utilizing a GMM as its prior\u0000distribution and distinguish between distinct cell populations by learning\u0000their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to\u0000determine the OT plan across these PDFs within the GMM framework. We evaluated\u0000our model against a CRISPR-mediated perturbation dataset, called CROP-seq,\u0000consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM\u0000is effective in cell state classification, aids in the analysis of differential\u0000gene expression, and ranks genes for target identification through a\u0000recommender system. It also predicts the effects of single-gene perturbations\u0000on downstream gene regulation and generates synthetic scRNA-seq data\u0000conditioned on specific cell states.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Multi-Domain Multi-Task Approach for Feature Selection from Bulk RNA Datasets 从大量 RNA 数据集中选择特征的多域多任务方法

arXiv - QuanBio - Genomics

Pub Date : 2024-05-04 DOI: arxiv-2405.02534

Karim Salta, Tomojit Ghosh, Michael Kirby

In this paper a multi-domain multi-task algorithm for feature selection inbulk RNAseq data is proposed. Two datasets are investigated arising from mousehost immune response to Salmonella infection. Data is collected from severalstrains of collaborative cross mice. Samples from the spleen and liver serve asthe two domains. Several machine learning experiments are conducted and thesmall subset of discriminative across domains features have been extracted ineach case. The algorithm proves viable and underlines the benefits of acrossdomain feature selection by extracting new subset of discriminative featureswhich couldn't be extracted only by one-domain approach.

本文提出了一种在大容量 RNAseq 数据中进行特征选择的多域多任务算法。本文研究了小鼠宿主对沙门氏菌感染的免疫反应所产生的两个数据集。数据采集自多个品系的协作杂交小鼠。脾脏和肝脏样本是两个领域。研究人员进行了多次机器学习实验，并提取了每个案例中具有跨域判别能力的特征子集。该算法证明是可行的，并通过提取单域方法无法提取的新的鉴别特征子集，强调了跨域特征选择的优势。

引用次数: 0

Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy 利用基于最大熵原理的无配对方法 GRAMEP 鉴定基因组中的 SNPs

arXiv - QuanBio - Genomics

Pub Date : 2024-05-02 DOI: arxiv-2405.01715

Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes

Advances in high throughput sequencing technologies provide a large number ofgenomes to be analyzed, so computational methodologies play a crucial role inanalyzing and extracting knowledge from the data generated. Investigatinggenomic mutations is critical because of their impact on chromosomal evolution,genetic disorders, and diseases. It is common to adopt aligning sequences foranalyzing genomic variations, however, this approach can be computationallyexpensive and potentially arbitrary in scenarios with large datasets. Here, wepresent a novel method for identifying single nucleotide polymorphisms (SNPs)in DNA sequences from assembled genomes. This method uses the principle ofmaximum entropy to select the most informative k-mers specific to the variantunder investigation. The use of this informative k-mer set enables thedetection of variant-specific mutations in comparison to a reference sequence.In addition, our method offers the possibility of classifying novel sequenceswith no need for organism-specific information. GRAMEP demonstrated highaccuracy in both in silico simulations and analyses of real viral genomes,including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurateSARS-CoV-2 variant identification while demonstrating a lower computationalcost compared to the gold-standard statistical tools. The source code for thisproof-of-concept implementation is freely available athttps://github.com/omatheuspimenta/GRAMEP.

高通量测序技术的进步提供了大量待分析的基因组，因此计算方法在分析和从生成的数据中提取知识方面发挥着至关重要的作用。基因组突变对染色体进化、遗传疾病和疾病都有影响，因此研究基因组突变至关重要。采用序列比对分析基因组变异的方法很常见，但这种方法计算成本高，而且在数据集较大的情况下可能会出现任意性。在这里，我们提出了一种从组装基因组中识别 DNA 序列中单核苷酸多态性（SNPs）的新方法。该方法利用最大熵原理，针对所研究的变异选择信息量最大的 k-位点。此外，我们的方法还提供了对新序列进行分类的可能性，而无需生物体特异性信息。在对包括登革热、HIV 和 SARS-CoV-2 在内的真实病毒基因组进行硅模拟和分析时，GRAMEP 都表现出了很高的准确性。与黄金标准统计工具相比，我们的方法既能保持对 SARS-CoV-2 变异识别的准确性，又能降低计算成本。这一概念验证实现的源代码可在https://github.com/omatheuspimenta/GRAMEP 免费获取。

{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":"https://doi.org/arxiv-2405.01715","url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\u0000genomes to be analyzed, so computational methodologies play a crucial role in\u0000analyzing and extracting knowledge from the data generated. Investigating\u0000genomic mutations is critical because of their impact on chromosomal evolution,\u0000genetic disorders, and diseases. It is common to adopt aligning sequences for\u0000analyzing genomic variations, however, this approach can be computationally\u0000expensive and potentially arbitrary in scenarios with large datasets. Here, we\u0000present a novel method for identifying single nucleotide polymorphisms (SNPs)\u0000in DNA sequences from assembled genomes. This method uses the principle of\u0000maximum entropy to select the most informative k-mers specific to the variant\u0000under investigation. The use of this informative k-mer set enables the\u0000detection of variant-specific mutations in comparison to a reference sequence.\u0000In addition, our method offers the possibility of classifying novel sequences\u0000with no need for organism-specific information. GRAMEP demonstrated high\u0000accuracy in both in silico simulations and analyses of real viral genomes,\u0000including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\u0000SARS-CoV-2 variant identification while demonstrating a lower computational\u0000cost compared to the gold-standard statistical tools. The source code for this\u0000proof-of-concept implementation is freely available at\u0000https://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-modality Matching and Prediction of Perturbation Responses with Labeled Gromov-Wasserstein Optimal Transport 利用标记格罗莫夫-瓦瑟斯坦最优传输进行跨模态匹配和扰动响应预测

arXiv - QuanBio - Genomics

Pub Date : 2024-05-01 DOI: arxiv-2405.00838

Jayoung Ryu, Romain Lopez, Charlotte Bunne, Aviv Regev

It is now possible to conduct large scale perturbation screens with complexreadout modalities, such as different molecular profiles or high content cellimages. While these open the way for systematic dissection of causal cellcircuits, integrated such data across screens to maximize our ability topredict circuits poses substantial computational challenges, which have notbeen addressed. Here, we extend two Gromov-Wasserstein Optimal Transportmethods to incorporate the perturbation label for cross-modality alignment. Theobtained alignment is then employed to train a predictive model that estimatescellular responses to perturbations observed with only one measurementmodality. We validate our method for the tasks of cross-modality alignment andcross-modality prediction in a recent multi-modal single-cell perturbationdataset. Our approach opens the way to unified causal models of cell biology.

现在可以利用复杂的读出模式（如不同的分子图谱或高含量细胞图像）进行大规模扰动筛选。虽然这些方法为系统地剖析因果细胞回路开辟了道路，但在筛选过程中整合这些数据以最大限度地提高我们预测回路的能力，在计算方面提出了巨大的挑战，而这些挑战尚未得到解决。在这里，我们扩展了两种格罗莫夫-瓦瑟斯坦最优传输方法，将扰动标签纳入跨模态配准。然后利用获得的配准来训练一个预测模型，该模型可以估计细胞对仅用一种测量模式观察到的扰动的反应。我们在最近的多模态单细胞扰动数据集中验证了我们的方法在跨模态配准和跨模态预测任务中的有效性。我们的方法为细胞生物学的统一因果模型开辟了道路。

引用次数: 0

Heterogeneity analysis provides evidence for a genetically homogeneous subtype of bipolar-disorder 异质性分析为双相情感障碍的基因同质亚型提供了证据

arXiv - QuanBio - Genomics

Pub Date : 2024-04-30 DOI: arxiv-2405.00159

Caroline C. McGrouther, Aaditya V. Rangan, Arianna Di Florio, Jeremy A. Elman, Nicholas J. Schork, John Kelsoe

Bipolar disorder is a highly heritable brain disorder which affects anestimated 50 million people worldwide. Due to recent advances in genotypingtechnology and bioinformatics methodology, as well as the increase in theoverall amount of available data, our understanding of the geneticunderpinnings of BD has improved. A growing consensus is that BD is polygenicand heterogeneous, but the specifics of that heterogeneity are not yet wellunderstood. Here we use a recently developed technique to investigate thegenetic heterogeneity of bipolar disorder. We find strong statistical evidencefor a `bicluster': a subset of bipolar subjects that exhibits adisease-specific genetic pattern. The structure illuminated by this biclusterreplicates in several other data-sets and can be used to improve BDrisk-prediction algorithms. We believe that this bicluster is likely tocorrespond to a genetically-distinct subtype of BD. More generally, we believethat our biclustering approach is a promising means of untangling theunderlying heterogeneity of complex disease without the need for reliablesubphenotypic data.

躁郁症是一种高度遗传性的脑部疾病，据估计影响着全球 5000 万人。由于基因分型技术和生物信息学方法的最新进展，以及可用数据总量的增加，我们对躁狂症遗传基础的认识有所提高。越来越多的人认为 BD 具有多基因性和异质性，但对这种异质性的具体情况还不甚了解。在这里，我们使用一种最新开发的技术来研究双相情感障碍的遗传异质性。我们发现了 "双群集"（bicluster）的有力统计证据：双相情感障碍受试者的一个子集表现出一种特定的遗传模式。这个双集群所揭示的结构在其他几个数据集中也得到了复制，并可用于改进躁郁症风险预测算法。我们认为，这个双簇很可能对应于遗传学上不同的 BD 亚型。更广泛地说，我们认为我们的双聚类方法是一种很有前途的手段，它可以在不需要可靠的表型数据的情况下，解开复杂疾病的潜在异质性。

{"title":"Heterogeneity analysis provides evidence for a genetically homogeneous subtype of bipolar-disorder","authors":"Caroline C. McGrouther, Aaditya V. Rangan, Arianna Di Florio, Jeremy A. Elman, Nicholas J. Schork, John Kelsoe","doi":"arxiv-2405.00159","DOIUrl":"https://doi.org/arxiv-2405.00159","url":null,"abstract":"Bipolar disorder is a highly heritable brain disorder which affects an\u0000estimated 50 million people worldwide. Due to recent advances in genotyping\u0000technology and bioinformatics methodology, as well as the increase in the\u0000overall amount of available data, our understanding of the genetic\u0000underpinnings of BD has improved. A growing consensus is that BD is polygenic\u0000and heterogeneous, but the specifics of that heterogeneity are not yet well\u0000understood. Here we use a recently developed technique to investigate the\u0000genetic heterogeneity of bipolar disorder. We find strong statistical evidence\u0000for a `bicluster': a subset of bipolar subjects that exhibits a\u0000disease-specific genetic pattern. The structure illuminated by this bicluster\u0000replicates in several other data-sets and can be used to improve BD\u0000risk-prediction algorithms. We believe that this bicluster is likely to\u0000correspond to a genetically-distinct subtype of BD. More generally, we believe\u0000that our biclustering approach is a promising means of untangling the\u0000underlying heterogeneity of complex disease without the need for reliable\u0000subphenotypic data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Season combinatorial intervention predictions with Salt & Peper 利用 Salt & Peper 进行季节组合干预预测

arXiv - QuanBio - Genomics

Pub Date : 2024-04-25 DOI: arxiv-2404.16907

Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards

Interventions play a pivotal role in the study of complex biological systems.In drug discovery, genetic interventions (such as CRISPR base editing) havebecome central to both identifying potential therapeutic targets andunderstanding a drug's mechanism of action. With the advancement of CRISPR andthe proliferation of genome-scale analyses such as transcriptomics, a newchallenge is to navigate the vast combinatorial space of concurrent geneticinterventions. Addressing this, our work concentrates on estimating the effectsof pairwise genetic combinations on the cellular transcriptome. We introducetwo novel contributions: Salt, a biologically-inspired baseline that posits themostly additive nature of combination effects, and Peper, a deep learning modelthat extends Salt's additive assumption to achieve unprecedented accuracy. Ourcomprehensive comparison against existing state-of-the-art methods, grounded indiverse metrics, and our out-of-distribution analysis highlight the limitationsof current models in realistic settings. This analysis underscores thenecessity for improved modelling techniques and data acquisition strategies,paving the way for more effective exploration of genetic intervention effects.

在药物发现方面，基因干预（如 CRISPR 碱基编辑）已成为确定潜在治疗靶点和了解药物作用机制的核心。随着CRISPR技术的发展和转录组学等基因组规模分析的普及，如何在同时进行的基因干预的巨大组合空间中进行导航成为了新的挑战。为了解决这个问题，我们的工作集中于估算成对遗传组合对细胞转录组的影响。我们做出了两项新贡献：盐"（Salt）和 "佩珀"（Peper）。"盐 "是一种受生物学启发的基线，它认为组合效应的本质是相加的；而 "佩珀 "则是一种深度学习模型，它扩展了 "盐 "的相加假设，实现了前所未有的准确性。我们与现有的最先进方法进行了全面比较，采用了多种指标，并进行了分布外分析，突出显示了当前模型在现实环境中的局限性。这一分析强调了改进建模技术和数据采集策略的必要性，为更有效地探索遗传干预效应铺平了道路。

{"title":"Season combinatorial intervention predictions with Salt & Peper","authors":"Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards","doi":"arxiv-2404.16907","DOIUrl":"https://doi.org/arxiv-2404.16907","url":null,"abstract":"Interventions play a pivotal role in the study of complex biological systems.\u0000In drug discovery, genetic interventions (such as CRISPR base editing) have\u0000become central to both identifying potential therapeutic targets and\u0000understanding a drug's mechanism of action. With the advancement of CRISPR and\u0000the proliferation of genome-scale analyses such as transcriptomics, a new\u0000challenge is to navigate the vast combinatorial space of concurrent genetic\u0000interventions. Addressing this, our work concentrates on estimating the effects\u0000of pairwise genetic combinations on the cellular transcriptome. We introduce\u0000two novel contributions: Salt, a biologically-inspired baseline that posits the\u0000mostly additive nature of combination effects, and Peper, a deep learning model\u0000that extends Salt's additive assumption to achieve unprecedented accuracy. Our\u0000comprehensive comparison against existing state-of-the-art methods, grounded in\u0000diverse metrics, and our out-of-distribution analysis highlight the limitations\u0000of current models in realistic settings. This analysis underscores the\u0000necessity for improved modelling techniques and data acquisition strategies,\u0000paving the way for more effective exploration of genetic intervention effects.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0