{"title":"A scalable approach to topic modelling in single-cell data by approximate pseudobulk projection.","authors":"Sishir Subedi, Tomokazu S Sumida, Yongjin P Park","doi":"10.26508/lsa.202402713","DOIUrl":null,"url":null,"abstract":"<p><p>Probabilistic topic modelling has become essential in many types of single-cell data analysis. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states. A dictionary matrix, consisting of topic-specific gene frequency vectors, provides interpretable bases to be compared with known cell type-specific marker genes and other pathway annotations. However, fitting a topic model on a large number of cells would require heavy computational resources-specialized computing units, computing time and memory. Here, we present a scalable approximation method customized for single-cell RNA-seq data analysis, termed ASAP, short for Annotating a Single-cell data matrix by Approximate Pseudobulk estimation. Our approach is more accurate than existing methods but requires orders of magnitude less computing time, leaving much lower memory consumption. We also show that our approach is widely applicable for atlas-scale data analysis; our method seamlessly integrates single-cell and bulk data in joint analysis, not requiring additional preprocessing or feature selection steps.</p>","PeriodicalId":18081,"journal":{"name":"Life Science Alliance","volume":"7 10","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11303850/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Life Science Alliance","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.26508/lsa.202402713","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/1 0:00:00","PubModel":"Print","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Probabilistic topic modelling has become essential in many types of single-cell data analysis. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states. A dictionary matrix, consisting of topic-specific gene frequency vectors, provides interpretable bases to be compared with known cell type-specific marker genes and other pathway annotations. However, fitting a topic model on a large number of cells would require heavy computational resources-specialized computing units, computing time and memory. Here, we present a scalable approximation method customized for single-cell RNA-seq data analysis, termed ASAP, short for Annotating a Single-cell data matrix by Approximate Pseudobulk estimation. Our approach is more accurate than existing methods but requires orders of magnitude less computing time, leaving much lower memory consumption. We also show that our approach is widely applicable for atlas-scale data analysis; our method seamlessly integrates single-cell and bulk data in joint analysis, not requiring additional preprocessing or feature selection steps.
期刊介绍:
Life Science Alliance is a global, open-access, editorially independent, and peer-reviewed journal launched by an alliance of EMBO Press, Rockefeller University Press, and Cold Spring Harbor Laboratory Press. Life Science Alliance is committed to rapid, fair, and transparent publication of valuable research from across all areas in the life sciences.