首页 > 最新文献

Journal of Machine Learning Research最新文献

英文 中文
Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace. 具有共同不变子空间的多个异构网络的推理。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-03-01
Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein

The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.

开发用于分析来自多个异构网络的数据的模型和方法在统计网络理论和广泛的应用领域中都具有重要意义。虽然单图分析已被广泛研究,但多图推断在很大程度上还未被探索,部分原因是在对图差异进行适当建模的同时又要保持足够的模型简洁性以保证估算的可行性所面临的固有挑战。本文正是为了弥补这一不足,引入了一个新模型--公共子空间独立边多随机图模型,该模型描述了具有共享顶点潜在结构但每个图的连接模式可能不同的异构网络集合。该模型涵盖了许多流行的网络表示法,包括随机块模型。该模型既具有足够的灵活性,可以有意义地解释重要的图差异,又具有足够的可操作性,可以在多个网络中进行精确推断。特别是,邻接矩阵的联合谱嵌入--多邻接谱嵌入--可同时一致地估计每个图的基本参数。在温和的附加假设条件下,估计值满足渐近正态性,并改进了图特征值估计。在模拟数据和真实数据中,该模型和嵌入可用于一系列后续网络推断任务,包括降维、分类、假设检验和群落检测。具体来说,当嵌入应用于通过扩散磁共振成像构建的连接组数据集时,结果是按人类主体对大脑扫描进行了准确分类,并对不同个体扫描的异质性做出了有意义的判断。
{"title":"Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace.","authors":"Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39543833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian time-aligned factor analysis of paired multivariate time series. 配对多变量时间序列的贝叶斯时间对齐因子分析。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Arkaprava Roy, Jana Schaich Borg, David B Dunson

Many modern data sets require inference methods that can estimate the shared and individual-specific components of variability in collections of matrices that change over time. Promising methods have been developed to analyze these types of data in static cases, but only a few approaches are available for dynamic settings. To address this gap, we consider novel models and inference methods for pairs of matrices in which the columns correspond to multivariate observations at different time points. In order to characterize common and individual features, we propose a Bayesian dynamic factor modeling framework called Time Aligned Common and Individual Factor Analysis (TACIFA) that includes uncertainty in time alignment through an unknown warping function. We provide theoretical support for the proposed model, showing identifiability and posterior concentration. The structure enables efficient computation through a Hamiltonian Monte Carlo (HMC) algorithm. We show excellent performance in simulations, and illustrate the method through application to a social mimicry experiment.

许多现代数据集需要推理方法,可以估计随时间变化的矩阵集合中可变性的共享和个体特定组成部分。已经开发出了在静态情况下分析这些类型数据的有前途的方法,但只有少数方法可用于动态设置。为了解决这一差距,我们考虑了矩阵对的新模型和推理方法,其中列对应于不同时间点的多变量观测。为了描述共同和个体特征,我们提出了一个贝叶斯动态因子建模框架,称为时间对齐的共同和个体因子分析(TACIFA),该框架通过未知的扭曲函数包含时间对齐的不确定性。我们为提出的模型提供了理论支持,显示了可识别性和后验浓度。该结构通过哈密顿蒙特卡罗(HMC)算法实现了高效的计算。我们在仿真中显示了良好的性能,并通过应用于社会模仿实验来说明该方法。
{"title":"Bayesian time-aligned factor analysis of paired multivariate time series.","authors":"Arkaprava Roy,&nbsp;Jana Schaich Borg,&nbsp;David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Many modern data sets require inference methods that can estimate the shared and individual-specific components of variability in collections of matrices that change over time. Promising methods have been developed to analyze these types of data in static cases, but only a few approaches are available for dynamic settings. To address this gap, we consider novel models and inference methods for pairs of matrices in which the columns correspond to multivariate observations at different time points. In order to characterize common and individual features, we propose a Bayesian dynamic factor modeling framework called Time Aligned Common and Individual Factor Analysis (TACIFA) that includes uncertainty in time alignment through an unknown warping function. We provide theoretical support for the proposed model, showing identifiability and posterior concentration. The structure enables efficient computation through a Hamiltonian Monte Carlo (HMC) algorithm. We show excellent performance in simulations, and illustrate the method through application to a social mimicry experiment.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9221555/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40398444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Soft Tensor Regression. 软张量回归
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-01-01
Georgia Papadogeorgou, Zhengwu Zhang, David B Dunson

Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a soft version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, irrespective of the true or approximation tensor rank, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.

在回归模型中,将张量预测因子与标量结果相关联的统计方法通常会将张量预测因子矢量化,并通过某种形式的正则化来估计其条目系数,或使用张量协变量的摘要,或使用系数张量的低维近似值。然而,如果真实秩不大,系数张量的低秩近似就会受到影响。我们提出了一种张量回归框架,它假定了一种软版本的并行因子(PARAFAC)近似。与传统的 PARAFAC(系数张量的每个条目都是张量模式中特定行贡献的乘积之和)不同,软张量回归(Soft)框架允许特定行的贡献围绕总体平均值变化。我们采用贝叶斯方法进行推理,结果表明,软化 PARAFAC 增加了模型的灵活性,改进了系数张量的估计,更准确地识别了重要的预测项,即使在近似等级较低的情况下,预测结果也更加精确。从理论角度来看,我们发现,无论真实或近似张量阶数如何,使用 Softer 都会导致系数张量的弱一致性后验分布,而使用经典 PARAFAC 进行张量回归时则不会出现这种结果。在我们的激励应用中,我们将 Softer 应用于对称和半对称张量预测,并分析了大脑网络特征与人类特征之间的关系。
{"title":"Soft Tensor Regression.","authors":"Georgia Papadogeorgou, Zhengwu Zhang, David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a <i>soft</i> version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, <i>irrespective of the true or approximation tensor rank</i>, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9222480/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40398446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data. 混合多视图数据的综合广义凸聚类优化与特征选择。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Minjie Wang, Genevera I Allen

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

在混合多视图数据中,在同一组样本上测量多组不同的特征。通过整合所有可用的数据源,我们试图发现可能隐藏在单个数据视图的个人聚类分析中的样本中的共同群体结构。虽然已经探索了几种用于这种集成聚类的技术,但我们提出并开发了一种凸形式化,它具有很强的经验性能,并继承了日益流行的凸聚类方法的数学特性。具体来说,我们的集成广义凸聚类优化(iGecco)方法为每个不同的数据视图使用不同的凸距离、损失或散度,并使用联合凸融合惩罚来导致共同组。此外,在每个数据源都是高维的情况下,集成混合多视图数据通常很困难。为了在这种情况下进行特征选择,我们开发了一种自适应移位组-套索惩罚,通过将特征缩小到特定于损失的中心来选择特征。我们所谓的iGecco+方法从每个数据视图中选择最适合确定组的特征,这通常导致改进的集成聚类。为了解决我们的问题,我们开发了一种使用子问题近似的新型广义多块ADMM算法,该算法更有效地适合我们的大数据集模型。通过一系列关于文本挖掘和基因组学的数值实验和实际数据实例,我们表明iGecco+在高维混合多视图数据上取得了优异的经验性能。
{"title":"Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data.","authors":"Minjie Wang,&nbsp;Genevera I Allen","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8570363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39596948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating Uncertainty Intervals from Collaborating Networks. 从协作网络中估算不确定性区间。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-01-01
Tianhui Zhou, Yitong Li, Yuan Wu, David Carlson

Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.

有效的决策需要了解预测中固有的不确定性。在回归中,这种不确定性可以通过多种方法进行估算;然而,其中许多方法在调整时非常费力,会产生过于自信的不确定性区间,或者缺乏锐度(给出不精确的区间)。为了应对这些挑战,我们提出了一种在回归中捕捉预测分布的新方法,即定义两个具有两种不同损失函数的神经网络。具体来说,一个网络逼近累积分布函数,第二个网络逼近其逆分布函数。我们将这种方法称为协作网络(CN)。理论分析表明,优化的固定点位于理想化解,而且该方法与地面实况分布渐近一致。从经验上看,学习是直接而稳健的。我们在两个合成数据集和六个真实数据集上,将 CN 与几种常见方法进行了比较,包括预测电子健康记录中糖尿病患者的 A1c 值,其中不确定性是至关重要的。在合成数据中,所提出的方法与地面实况基本吻合。在真实世界数据集中,CN 提高了许多性能指标,包括对数似然估计、平均绝对误差、覆盖估计和预测区间宽度。
{"title":"Estimating Uncertainty Intervals from Collaborating Networks.","authors":"Tianhui Zhou, Yitong Li, Yuan Wu, David Carlson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9231643/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9138923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Distance Clustering. 贝叶斯距离聚类
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-01-01
Leo L Duan, David B Dunson

Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.

基于模型的聚类被广泛应用于各种应用领域。然而,人们对其稳健性仍然存在根本性的担忧。特别是,结果可能对代表聚类内部数据密度的核的选择很敏感。利用数据点之间成对差异的特性,我们提出了一类贝叶斯距离聚类方法,这种方法依赖于对成对距离的可能性建模来代替原始数据。虽然丢弃了数据中的一些信息,但我们获得了对建模假设的实质性稳健性。所提出的方法是距离聚类和基于模型的聚类之间的一个有吸引力的中间地带,汲取了这两种典型方法的优点。我们展示了在推断通常选择的内核不能很好代表的聚类的能力方面取得的巨大进步。我们将这种方法应用于大脑基因组表达数据的聚类。
{"title":"Bayesian Distance Clustering.","authors":"Leo L Duan, David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9245927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10620738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference for the Case Probability in High-dimensional Logistic Regression. 高维逻辑回归中的案例概率推断。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Zijian Guo, Prabrisha Rakshit, Daniel S Herman, Jinbo Chen

Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.

在电子健康记录中对患者的疾病或病情状态(即病例或对照状态)进行标记,越来越依赖于使用从结构化和非结构化电子健康记录数据中提取的高维变量的预测模型。目前的一个主要障碍是缺乏有效的病例概率统计推断方法。在本文中,考虑到用于预测的高维稀疏逻辑回归模型,我们通过开发线性化和方差增强技术,提出了一种新型的病例概率偏差校正估计器。我们确定了所提出的估计器在高维度下对任何载荷向量的渐近正态性。我们构建了病例概率的置信区间,并提出了患者病例对照标记的假设检验程序。我们通过大量的模拟研究并将其应用于真实世界的电子健康记录数据中,展示了所提出的方法。
{"title":"Inference for the Case Probability in High-dimensional Logistic Regression.","authors":"Zijian Guo, Prabrisha Rakshit, Daniel S Herman, Jinbo Chen","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9354733/pdf/nihms-1824953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40686598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Bayes Matrix Factorization. 经验贝叶斯矩阵分解。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Wei Wang, Matthew Stephens

Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations ("Sparse FA/PCA"), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called "normal means" problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.

矩阵分解方法,包括因子分析(FA)和主成分分析(PCA),被广泛用于推断和总结多元数据中的结构。许多这样的方法使用惩罚或先验分布来实现稀疏表示(“稀疏FA/PCA”),关键问题是诱导多少稀疏性。在这里,我们介绍了一种用于矩阵分解(EBMF)的通用经验贝叶斯方法,其关键特征是通过从观测数据中估计先验分布来估计适当的稀疏性。该方法非常灵活:它允许广泛的不同先验族,并允许矩阵分解的每个分量可能表现出不同的稀疏性。这种灵活性的关键是使用变分近似,我们证明了变分近似有效地减少了EBMF模型的拟合,从而解决了一个更简单的问题,即所谓的“正态均值”问题。我们通过与竞争方法的数值比较以及对GTEx(基因型组织表达)项目中44个人类组织的遗传关联数据的分析,证明了稀疏先验的EBMF的优势。在数值比较中,EBMF通常比其他方法提供更准确的推断。在GTEx数据中,EBMF确定了与人类组织之间的已知关系一致的可解释结构。实现我们方法的软件可在https://github.com/stephenslab/flashr.
{"title":"Empirical Bayes Matrix Factorization.","authors":"Wei Wang, Matthew Stephens","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations (\"Sparse FA/PCA\"), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called \"normal means\" problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10621241/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71428598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures. 最佳预测程序的对抗性蒙特卡罗元学习。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Alex Luedtke, Incheoul Chung, Oleg Sofrygin

We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor's objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.

我们将预测程序的元学习设计为在双人游戏中寻找最佳策略。在这场博弈中,"自然 "会对产生由特征和相关结果组成的标记数据的分布选择一个先验,而 "预测者 "则观察从该先验的分布中采样的数据。预测者的目标是学习一个从新特征映射到相关结果估计值的函数。我们发现,在合理的条件下,预测器有一个最优策略,该策略对结果的移动和重定向具有等变性,并且对观察结果的排列以及特征的移动、重定向和排列具有不变性。我们引入了一种满足这些特性的神经网络架构。在参数和非参数实验中,与标准实践相比,所提出的策略都表现出色。
{"title":"Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures.","authors":"Alex Luedtke, Incheoul Chung, Oleg Sofrygin","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor's objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10928557/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140111982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation and Optimization of Composite Outcomes. 综合结果的估算和优化。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-01-01
Daniel J Luckett, Eric B Laber, Siyeon Kim, Michael R Kosorok

There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often seek to balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.

人们对精准医疗产生了浓厚的兴趣,认为这是一种通过根据个体特征量身定制治疗方案来改善患者预后的手段。个体化治疗规则将精准医疗正式定义为从患者信息到推荐治疗的映射。如果治疗规则能使相关人群的标量结果(如症状减轻)的平均值最大化,那么它就被定义为最佳治疗规则。然而,临床和干预科学家往往需要平衡多种可能相互竞争的结果,如症状减轻和不良事件风险。在这种情况下,精准医疗的一种方法是得出一个能平衡所有竞争结果的综合结果;遗憾的是,如果没有高质量的工具,直接从患者那里得出综合结果是很困难的,而且专家得出的综合结果可能无法考虑患者偏好的异质性。我们提出了一种利用观察数据研究精准医疗的新模式,该模式完全依赖于临床医生近似(即不完全)做出决策以最大化患者个人效用这一假设。估算出的综合结果随后被用于构建个体化治疗规则的估算器,该规则能最大化患者特定综合结果的平均值。估算出的综合结果和估算出的最佳个体化治疗规则为了解患者偏好异质性、临床医生行为以及特定领域精准医疗的价值提供了新的视角。我们推导出了温和条件下拟议估计器的推断程序,并通过一系列模拟实验和双相抑郁症研究数据的示例应用证明了它们的有限样本性能。
{"title":"Estimation and Optimization of Composite Outcomes.","authors":"Daniel J Luckett, Eric B Laber, Siyeon Kim, Michael R Kosorok","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often seek to balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8562677/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39588763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Machine Learning Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1