Samuel H Church, Jasmine L Mah, Günter Wagner, Casey W Dunn
{"title":"Normalizing need not be the norm: count-based math for analyzing single-cell data.","authors":"Samuel H Church, Jasmine L Mah, Günter Wagner, Casey W Dunn","doi":"10.1007/s12064-023-00408-x","DOIUrl":null,"url":null,"abstract":"<p><p>Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.</p>","PeriodicalId":54428,"journal":{"name":"Theory in Biosciences","volume":" ","pages":"45-62"},"PeriodicalIF":1.3000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theory in Biosciences","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s12064-023-00408-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/10 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.
期刊介绍:
Theory in Biosciences focuses on new concepts in theoretical biology. It also includes analytical and modelling approaches as well as philosophical and historical issues. Central topics are:
Artificial Life;
Bioinformatics with a focus on novel methods, phenomena, and interpretations;
Bioinspired Modeling;
Complexity, Robustness, and Resilience;
Embodied Cognition;
Evolutionary Biology;
Evo-Devo;
Game Theoretic Modeling;
Genetics;
History of Biology;
Language Evolution;
Mathematical Biology;
Origin of Life;
Philosophy of Biology;
Population Biology;
Systems Biology;
Theoretical Ecology;
Theoretical Molecular Biology;
Theoretical Neuroscience & Cognition.