Journal of Machine Learning Research最新文献_第5页

A flexible model-free prediction-based framework for feature ranking. 一个灵活的、无模型的、基于预测的特征排序框架。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-05-01

Jingyi Jessica Li, Yiling Elaine Chen, Xin Tong

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.

尽管有许多统计和机器学习工具可用于联合特征建模，但许多科学家对特征进行了边缘研究，即一次研究一个特征。这部分是由于训练和惯例，但也源于科学家对简单可视化和可解释性的强烈兴趣。因此，在科学发现的过程中，对某些预测任务(如癌症驱动基因的预测)的边缘特征排序被广泛应用。在这项工作中，我们专注于二元分类的边缘排序，这是最常见的预测任务之一。我们认为，最广泛使用的边际排序标准，包括Pearson相关性、两样本t检验和两样本Wilcoxon秩和检验，没有充分考虑特征分布和预测目标。为了解决实践中的这一差距，我们提出了两个与两个预测目标相对应的排名标准:经典标准(CC)和Neyman-Pearson标准(NPC)，两者都使用无模型非参数实现来适应不同的特征分布。从理论上讲，我们证明了在规则条件下，这两个标准都以高概率实现了与其总体水平对应的样本水平排名一致。此外，当样本中的两个类别比例偏离总体时，NPC对抽样偏差具有鲁棒性。这一特性使NPC在抽样偏差普遍存在的生物医学研究中具有良好的潜力。我们展示了CC和NPC在仿真和实际数据研究中的使用及其相对优势。我们的无模型的基于目标的排序思想可以扩展到对特征子集进行排序，并且可以推广到其他预测任务和学习目标。

{"title":"A flexible model-free prediction-based framework for feature ranking.","authors":"Jingyi Jessica Li, Yiling Elaine Chen, Xin Tong","doi":"","DOIUrl":"","url":null,"abstract":"Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10265462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints. 数据共享约束下的异质性整合高维多重测试

IF 4.3 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-04-01

Molei Liu, Yin Xia, Kelly Cho, Tianxi Cai

Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.

在高维回归模型中识别有信息量的预测因子是关联分析和预测建模的关键步骤。由于样本量有限，高维环境下的信号检测往往会失败。提高分析能力的一种方法是对涉及同一科学问题的多项研究进行荟萃分析。然而，在存在研究间异质性的情况下，对来自多项研究的高维数据进行综合分析具有挑战性。在额外的数据共享限制条件下，不同研究地点之间只能共享摘要数据，因此这一挑战就更加突出。在本文中，我们提出了一种新颖的数据屏蔽集成大规模测试（DSILT）方法来进行信号检测，这种方法允许研究间异质性，而且不需要共享个体水平的数据。假设不同研究的基础高维数据回归模型各不相同，但具有相似的支持，所提出的方法结合了适当的整合估计和去杂程序，以构建特定协变量总体效应的检验统计量。我们还开发了多重检验程序，在控制误发现率（FDR）和误发现比例（FDP）的同时识别显著效应。我们研究了新测试程序与理想个体水平荟萃分析（ILMA）方法和其他分布式推断方法的理论比较。模拟研究表明，建议的测试程序在控制误发现率和获得功率方面都表现出色。新方法被应用于一个实际例子，检测他汀类药物和肥胖的遗传变异对 II 型糖尿病风险的交互效应。

{"title":"Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints.","authors":"Molei Liu, Yin Xia, Kelly Cho, Tianxi Cai","doi":"","DOIUrl":"","url":null,"abstract":"Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327421/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9811440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace. 具有共同不变子空间的多个异构网络的推理。

IF 4.3 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-03-01

Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein

The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.

开发用于分析来自多个异构网络的数据的模型和方法在统计网络理论和广泛的应用领域中都具有重要意义。虽然单图分析已被广泛研究，但多图推断在很大程度上还未被探索，部分原因是在对图差异进行适当建模的同时又要保持足够的模型简洁性以保证估算的可行性所面临的固有挑战。本文正是为了弥补这一不足，引入了一个新模型--公共子空间独立边多随机图模型，该模型描述了具有共享顶点潜在结构但每个图的连接模式可能不同的异构网络集合。该模型涵盖了许多流行的网络表示法，包括随机块模型。该模型既具有足够的灵活性，可以有意义地解释重要的图差异，又具有足够的可操作性，可以在多个网络中进行精确推断。特别是，邻接矩阵的联合谱嵌入--多邻接谱嵌入--可同时一致地估计每个图的基本参数。在温和的附加假设条件下，估计值满足渐近正态性，并改进了图特征值估计。在模拟数据和真实数据中，该模型和嵌入可用于一系列后续网络推断任务，包括降维、分类、假设检验和群落检测。具体来说，当嵌入应用于通过扩散磁共振成像构建的连接组数据集时，结果是按人类主体对大脑扫描进行了准确分类，并对不同个体扫描的异质性做出了有意义的判断。

{"title":"Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace.","authors":"Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein","doi":"","DOIUrl":"","url":null,"abstract":"The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 141","pages":"1-49"},"PeriodicalIF":4.3,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39543833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimating Uncertainty Intervals from Collaborating Networks. 从协作网络中估算不确定性区间。

IF 4.3 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-01-01

Tianhui Zhou, Yitong Li, Yuan Wu, David Carlson

Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.

有效的决策需要了解预测中固有的不确定性。在回归中，这种不确定性可以通过多种方法进行估算；然而，其中许多方法在调整时非常费力，会产生过于自信的不确定性区间，或者缺乏锐度（给出不精确的区间）。为了应对这些挑战，我们提出了一种在回归中捕捉预测分布的新方法，即定义两个具有两种不同损失函数的神经网络。具体来说，一个网络逼近累积分布函数，第二个网络逼近其逆分布函数。我们将这种方法称为协作网络（CN）。理论分析表明，优化的固定点位于理想化解，而且该方法与地面实况分布渐近一致。从经验上看，学习是直接而稳健的。我们在两个合成数据集和六个真实数据集上，将 CN 与几种常见方法进行了比较，包括预测电子健康记录中糖尿病患者的 A1c 值，其中不确定性是至关重要的。在合成数据中，所提出的方法与地面实况基本吻合。在真实世界数据集中，CN 提高了许多性能指标，包括对数似然估计、平均绝对误差、覆盖估计和预测区间宽度。

{"title":"Estimating Uncertainty Intervals from Collaborating Networks.","authors":"Tianhui Zhou, Yitong Li, Yuan Wu, David Carlson","doi":"","DOIUrl":"","url":null,"abstract":"Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9231643/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9138923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian Distance Clustering. 贝叶斯距离聚类

IF 4.3 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-01-01

Leo L Duan, David B Dunson

Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.

基于模型的聚类被广泛应用于各种应用领域。然而，人们对其稳健性仍然存在根本性的担忧。特别是，结果可能对代表聚类内部数据密度的核的选择很敏感。利用数据点之间成对差异的特性，我们提出了一类贝叶斯距离聚类方法，这种方法依赖于对成对距离的可能性建模来代替原始数据。虽然丢弃了数据中的一些信息，但我们获得了对建模假设的实质性稳健性。所提出的方法是距离聚类和基于模型的聚类之间的一个有吸引力的中间地带，汲取了这两种典型方法的优点。我们展示了在推断通常选择的内核不能很好代表的聚类的能力方面取得的巨大进步。我们将这种方法应用于大脑基因组表达数据的聚类。

引用次数: 0

Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures. 最佳预测程序的对抗性蒙特卡罗元学习。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-01-01

Alex Luedtke, Incheoul Chung, Oleg Sofrygin

We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor's objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.

我们将预测程序的元学习设计为在双人游戏中寻找最佳策略。在这场博弈中，"自然 "会对产生由特征和相关结果组成的标记数据的分布选择一个先验，而 "预测者 "则观察从该先验的分布中采样的数据。预测者的目标是学习一个从新特征映射到相关结果估计值的函数。我们发现，在合理的条件下，预测器有一个最优策略，该策略对结果的移动和重定向具有等变性，并且对观察结果的排列以及特征的移动、重定向和排列具有不变性。我们引入了一种满足这些特性的神经网络架构。在参数和非参数实验中，与标准实践相比，所提出的策略都表现出色。

引用次数: 0

Empirical Bayes Matrix Factorization. 经验贝叶斯矩阵分解。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-01-01

Wei Wang, Matthew Stephens

Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations ("Sparse FA/PCA"), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called "normal means" problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.

矩阵分解方法，包括因子分析（FA）和主成分分析（PCA），被广泛用于推断和总结多元数据中的结构。许多这样的方法使用惩罚或先验分布来实现稀疏表示（“稀疏FA/PCA”），关键问题是诱导多少稀疏性。在这里，我们介绍了一种用于矩阵分解（EBMF）的通用经验贝叶斯方法，其关键特征是通过从观测数据中估计先验分布来估计适当的稀疏性。该方法非常灵活：它允许广泛的不同先验族，并允许矩阵分解的每个分量可能表现出不同的稀疏性。这种灵活性的关键是使用变分近似，我们证明了变分近似有效地减少了EBMF模型的拟合，从而解决了一个更简单的问题，即所谓的“正态均值”问题。我们通过与竞争方法的数值比较以及对GTEx（基因型组织表达）项目中44个人类组织的遗传关联数据的分析，证明了稀疏先验的EBMF的优势。在数值比较中，EBMF通常比其他方法提供更准确的推断。在GTEx数据中，EBMF确定了与人类组织之间的已知关系一致的可解释结构。实现我们方法的软件可在https://github.com/stephenslab/flashr.

{"title":"Empirical Bayes Matrix Factorization.","authors":"Wei Wang, Matthew Stephens","doi":"","DOIUrl":"","url":null,"abstract":"Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations (\"Sparse FA/PCA\"), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called \"normal means\" problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10621241/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71428598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Flexible Signal Denoising via Flexible Empirical Bayes Shrinkage. 通过灵活的经验贝叶斯收缩技术实现灵活的信号去噪。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2021-01-01

Zhengrong Xing, Peter Carbonetto, Matthew Stephens

Signal denoising-also known as non-parametric regression-is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying "effects," hence automatically select an appropriate amount of shrinkage. However, most existing implementations of empirical Bayes shrinkage are less flexible than they could be-both in their assumptions on the underlying distribution of effects, and in their ability to handle heteroskedasticity-which limits their signal denoising applications. Here we address this by adopting a particularly flexible, stable and computationally convenient empirical Bayes shrinkage method and applying it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built empirical Bayes procedures. Our methods are implemented in the R package smashr, "SMoothing by Adaptive SHrinkage in R," available at https://www.github.com/stephenslab/smashr.

信号去噪--也称为非参数回归--通常是通过在变换（如小波）域中进行收缩估计来实现的；变换域中的收缩相当于原始域中的平滑。此类应用中的一个关键问题是缩小多少，或者说，平滑多少。经验贝叶斯收缩方法为这一问题提供了一个极具吸引力的解决方案；它们利用数据来估计潜在 "效应 "的分布，从而自动选择适当的收缩量。然而，大多数现有的经验贝叶斯收缩法的实现都不够灵活，无论是在对基本效应分布的假设上，还是在处理异方差的能力上，都限制了它们在信号去噪方面的应用。为了解决这个问题，我们采用了一种特别灵活、稳定且计算方便的经验贝叶斯收缩方法，并将其应用于几个信号去噪问题。这些应用包括平滑泊松数据和异方差高斯数据。通过经验比较，我们发现该方法的结果与其他方法（包括简单的阈值规则和专门设计的经验贝叶斯程序）相比具有竞争力。我们的方法在 R 软件包 smashr（"SMoothing by Adaptive SHrinkage in R"）中实现，请访问 https://www.github.com/stephenslab/smashr。

{"title":"Flexible Signal Denoising via Flexible Empirical Bayes Shrinkage.","authors":"Zhengrong Xing, Peter Carbonetto, Matthew Stephens","doi":"","DOIUrl":"","url":null,"abstract":"Signal denoising-also known as non-parametric regression-is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying \"effects,\" hence automatically select an appropriate amount of shrinkage. However, most existing implementations of empirical Bayes shrinkage are less flexible than they could be-both in their assumptions on the underlying distribution of effects, and in their ability to handle heteroskedasticity-which limits their signal denoising applications. Here we address this by adopting a particularly flexible, stable and computationally convenient empirical Bayes shrinkage method and applying it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built empirical Bayes procedures. Our methods are implemented in the R package smashr, \"SMoothing by Adaptive SHrinkage in R,\" available at https://www.github.com/stephenslab/smashr.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"22 ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10751020/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139040830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nonparametric graphical model for counts. 计数的非参数图形模型。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2020-12-01

Arkaprava Roy, David B Dunson

Although multivariate count data are routinely collected in many application areas, there is surprisingly little work developing flexible models for characterizing their dependence structure. This is particularly true when interest focuses on inferring the conditional independence graph. In this article, we propose a new class of pairwise Markov random field-type models for the joint distribution of a multivariate count vector. By employing a novel type of transformation, we avoid restricting to non-negative dependence structures or inducing other restrictions through truncations. Taking a Bayesian approach to inference, we choose a Dirichlet process prior for the distribution of a random effect to induce great flexibility in the specification. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We prove various theoretical properties, including posterior consistency, and show that our COunt Nonparametric Graphical Analysis (CONGA) approach has good performance relative to competitors in simulation studies. The methods are motivated by an application to neuron spike count data in mice.

尽管在许多应用领域中经常收集多变量计数数据，但令人惊讶的是，很少有工作开发灵活的模型来表征它们的依赖结构。当兴趣集中在推断条件独立图时，这一点尤其正确。本文提出了一类新的多元计数向量联合分布的成对马尔可夫随机场模型。通过采用一种新颖的变换，我们避免了对非负依赖结构的限制或通过截断引起的其他限制。采用贝叶斯方法进行推理，我们为随机效应的分布选择了一个Dirichlet过程，以在规范中诱导很大的灵活性。提出了一种有效的后验计算马尔可夫链蒙特卡罗算法。我们证明了各种理论性质，包括后验一致性，并表明我们的计数非参数图形分析(CONGA)方法在模拟研究中相对于竞争对手具有良好的性能。这些方法的动机来自于对小鼠神经元尖峰计数数据的应用。

引用次数: 0

Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality. 二元多路数据学习:概率张量分解及其统计最优性。

IF 6 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research

Pub Date : 2020-07-01

Miaoyan Wang, Lexin Li

We consider the problem of decomposing a higher-order tensor with binary entries. Such data problems arise frequently in applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose a multilinear Bernoulli model, develop a rank-constrained likelihood-based estimation method, and obtain the theoretical accuracy guarantees. In contrast to continuous-valued problems, the binary tensor problem exhibits an interesting phase transition phenomenon according to the signal-to-noise ratio. The error bound for the parameter tensor estimation is established, and we show that the obtained rate is minimax optimal under the considered model. Furthermore, we develop an alternating optimization algorithm with convergence guarantees. The efficacy of our approach is demonstrated through both simulations and analyses of multiple data sets on the tasks of tensor completion and clustering.

我们考虑具有二元项的高阶张量的分解问题。这类数据问题在神经成像、推荐系统、主题建模、传感器网络定位等应用中经常出现。提出了多线性伯努利模型，提出了基于秩约束的似然估计方法，并获得了理论精度保证。与连续值问题相比，根据信噪比，二元张量问题表现出有趣的相变现象。建立了参数张量估计的误差界，并证明了在考虑的模型下得到的速率是极小极大最优的。在此基础上，提出了一种具有收敛性保证的交替优化算法。通过对多个数据集的张量补全和聚类任务的模拟和分析，证明了我们方法的有效性。

引用次数: 0