Oliver M Crook, Kathryn S Lilley, Laurent Gatto, Paul D W Kirk
{"title":"空间蛋白质组学的半监督非参数贝叶斯建模","authors":"Oliver M Crook, Kathryn S Lilley, Laurent Gatto, Paul D W Kirk","doi":"10.1214/22-AOAS1603","DOIUrl":null,"url":null,"abstract":"<p><p>Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of <i>marker proteins</i> (i.e. proteins with <i>a priori</i> known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on <i>Drosophila</i> embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7613899/pdf/EMS143956.pdf","citationCount":"0","resultStr":"{\"title\":\"Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.\",\"authors\":\"Oliver M Crook, Kathryn S Lilley, Laurent Gatto, Paul D W Kirk\",\"doi\":\"10.1214/22-AOAS1603\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of <i>marker proteins</i> (i.e. proteins with <i>a priori</i> known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on <i>Drosophila</i> embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.</p>\",\"PeriodicalId\":50772,\"journal\":{\"name\":\"Annals of Applied Statistics\",\"volume\":\"16 4\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7613899/pdf/EMS143956.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Applied Statistics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1214/22-AOAS1603\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/22-AOAS1603","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
了解亚细胞蛋白质定位是分析特定环境蛋白质功能的重要组成部分。定量质谱分析(MS)技术的最新进展,已将数千种蛋白质高分辨率地绘制到细胞内的亚细胞位置。因此有必要采用新的建模方法来捕捉这些数据的复杂性质。我们在非参数贝叶斯框架下,利用高斯过程回归模型的 K 分量混合物来分析空间蛋白质组学数据。高斯过程回归模型考虑了亚细胞龛内的相关结构,每个混合物成分捕捉每个龛内观察到的不同相关结构。标记蛋白质(即具有先验已知标记位置的蛋白质)的可用性促使我们采用半监督学习方法为高斯过程超参数提供信息。此外,我们还为我们的模型提供了一个高效的哈密顿-内-吉布斯采样器(Hamiltonian-within-Gibbs sampler)。此外,我们还利用协方差矩阵的结构,减轻了与协方差矩阵反演相关的计算负担。通过对协方差矩阵进行张量分解,可以应用扩展的 Trench 和 Durbin 算法来降低反演的计算复杂度,从而加快计算速度。我们提供了果蝇胚胎和小鼠多能胚胎干细胞的详细案例研究,以说明半监督功能贝叶斯数据建模的好处。
Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.
Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of marker proteins (i.e. proteins with a priori known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on Drosophila embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.
期刊介绍:
Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.