Leveraging gene correlations in single cell transcriptomic data

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-09-18 DOI:10.1186/s12859-024-05926-z
Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander
{"title":"Leveraging gene correlations in single cell transcriptomic data","authors":"Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander","doi":"10.1186/s12859-024-05926-z","DOIUrl":null,"url":null,"abstract":"Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05926-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用单细胞转录组数据中的基因相关性
为了克服单细胞 RNA 测序(scRNAseq)中的技术噪音,人们开发了许多方法。随着研究人员深入挖掘数据,寻找罕见细胞类型、细胞状态的微妙之处以及基因调控网络的细节,他们越来越需要精确度可控、临时参数和阈值较少的算法。阻碍这一目标实现的事实是,scRNAseq 的适当空分布不能简单地从生物变异基本真相未知(即通常情况下)的数据中提取。我们采用分析方法来解决这个问题,假设 scRNAseq 数据只反映细胞异质性(我们试图描述的特征)、转录噪声(随机分布在细胞中的时间波动)和采样误差(即泊松噪声)。我们分析 scRNAseq 数据时没有进行归一化处理--这一步会使分布偏斜,尤其是稀疏数据--而是计算与关键统计量相关的 p 值。我们开发了一种改进的方法,用于选择细胞聚类的特征和识别基因与基因之间的正负相关性。通过模拟数据,我们证明了这种我们称之为 BigSur(来自非规范化读数的基本信息学和基因统计)的方法甚至能捕捉到 scRNAseq 数据中微弱但重要的相关结构。将 BigSur 应用于克隆人类黑色素瘤细胞系的数据时,我们发现了成千上万的相关性,当这些相关性在没有监督的情况下聚类成基因群落时,它们与已知的细胞成分和生物过程相一致,并突出了潜在的新型细胞生物学关系。使用基于统计学的方法来识别基因-基因相关性,可以获得对功能相关基因调控网络的新见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
Rare copy number variant analysis in case-control studies using snp array data: a scalable and automated data analysis pipeline. Mining contextually meaningful subgraphs from a vertex-attributed graph. Robust double machine learning model with application to omics data. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads. Closha 2.0: a bio-workflow design system for massive genome data analysis on high performance cluster infrastructure.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1