贝叶斯数据选择。

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Journal of Machine Learning Research Pub Date : 2023-01-01

Eli N Weinstein, Jeffrey W Miller

{"title":"贝叶斯数据选择。","authors":"Eli N Weinstein, Jeffrey W Miller","doi":"","DOIUrl":null,"url":null,"abstract":"Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the \"data selection\" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining \"background\" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the \"Stein volume criterion (SVC)\", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 23","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10194814/pdf/","citationCount":"0","resultStr":"{\"title\":\"Bayesian Data Selection.\",\"authors\":\"Eli N Weinstein, Jeffrey W Miller\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the \\\"data selection\\\" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining \\\"background\\\" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the \\\"Stein volume criterion (SVC)\\\", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.\",\"PeriodicalId\":50161,\"journal\":{\"name\":\"Journal of Machine Learning Research\",\"volume\":\"24 23\",\"pages\":\"\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10194814/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Machine Learning Research\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

通过发现与感兴趣的模型匹配或不匹配的数据特征，可以获得对复杂高维数据的洞察。为了形式化这个任务，我们引入了“数据选择”问题:找到一个较低维的统计量——比如变量的子集——它与给定的参数模型很好地拟合。数据选择的完全贝叶斯方法是对统计值进行参数化建模，对数据的剩余“背景”成分进行非参数化建模，并对统计值的选择执行标准贝叶斯模型选择。然而，拟合一个非参数模型到高维数据往往是非常低效的，统计和计算。我们提出了一种用于执行数据选择的新评分，即“Stein体积准则(SVC)”，它不需要拟合非参数模型。SVC采用广义边际似然的形式，用核化的Stein差异代替Kullback-Leibler散度。证明了SVC在数据选择上是一致的，并建立了相应的广义后验在参数上的一致性和渐近正态性。我们使用概率主成分分析和基因调控的自旋玻璃模型将SVC应用于单细胞RNA测序数据集的分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Bayesian Data Selection.

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.