Pub Date : 2017-08-28DOI: 10.13140/RG.2.2.27112.37120
Z. Szabó, Bharath K. Sriperumbudur
Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.
{"title":"Characteristic and Universal Tensor Product Kernels","authors":"Z. Szabó, Bharath K. Sriperumbudur","doi":"10.13140/RG.2.2.27112.37120","DOIUrl":"https://doi.org/10.13140/RG.2.2.27112.37120","url":null,"abstract":"Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"32 1","pages":"233:1-233:29"},"PeriodicalIF":0.0,"publicationDate":"2017-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79714976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-27DOI: 10.4230/LIPIcs.ITCS.2018.5
Maria-Florina Balcan, Yingyu Liang, David P. Woodruff, Hongyang Zhang
This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and its dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and show that under certain dual conditions, the optimal solution of the matrix factorization program is the same as its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization problems are hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly. We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis (PCA). These are examples of efficiently recovering a hidden matrix given limited reliable observations of it. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity guarantees for matrix completion and robust PCA.
{"title":"Matrix Completion and Related Problems via Strong Duality","authors":"Maria-Florina Balcan, Yingyu Liang, David P. Woodruff, Hongyang Zhang","doi":"10.4230/LIPIcs.ITCS.2018.5","DOIUrl":"https://doi.org/10.4230/LIPIcs.ITCS.2018.5","url":null,"abstract":"This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and its dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and show that under certain dual conditions, the optimal solution of the matrix factorization program is the same as its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization problems are hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly. \u0000 \u0000We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis (PCA). These are examples of efficiently recovering a hidden matrix given limited reliable observations of it. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity guarantees for matrix completion and robust PCA.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"35 1","pages":"102:1-102:56"},"PeriodicalIF":0.0,"publicationDate":"2017-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91541053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-03-23DOI: 10.3929/ETHZ-B-000269194
Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman
How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.
{"title":"Training Gaussian Mixture Models at Scale via Coresets","authors":"Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman","doi":"10.3929/ETHZ-B-000269194","DOIUrl":"https://doi.org/10.3929/ETHZ-B-000269194","url":null,"abstract":"How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"13 1","pages":"160:1-160:25"},"PeriodicalIF":0.0,"publicationDate":"2017-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89747655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we propose a modified version of the simulated annealing algorithm for solving a stochastic global optimization problem. More precisely, we address the problem of finding a global minimizer of a function with noisy evaluations. We provide a rate of convergence and its optimized parametrization to ensure a minimal number of evaluations for a given accuracy and a confidence level close to 1. This work is completed with a set of numerical experimentations and assesses the practical performance both on benchmark test cases and on real world examples.
{"title":"Convergence Rate of a Simulated Annealing Algorithm with Noisy Observations","authors":"Clément Bouttier, Ioana Gavra","doi":"10.5555/3322706.3322710","DOIUrl":"https://doi.org/10.5555/3322706.3322710","url":null,"abstract":"In this paper we propose a modified version of the simulated annealing algorithm for solving a stochastic global optimization problem. More precisely, we address the problem of finding a global minimizer of a function with noisy evaluations. We provide a rate of convergence and its optimized parametrization to ensure a minimal number of evaluations for a given accuracy and a confidence level close to 1. This work is completed with a set of numerical experimentations and assesses the practical performance both on benchmark test cases and on real world examples.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"38 1","pages":"4:1-4:45"},"PeriodicalIF":0.0,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75707618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the years, different characterizations of the curse of dimensionality have been provided, usually stating the conditions under which, in the limit of the infinite dimensionality, distances become indistinguishable. However, these characterizations almost never address the form of associated distributions in the finite, although high-dimensional, case. This work aims to contribute in this respect by investigating the distribution of distances, and of direct and reverse nearest neighbors, in intrinsically high-dimensional spaces. Indeed, we derive a closed form for the distribution of distances from a given point, for the expected distance from a given point to its kth nearest neighbor, and for the expected size of the approximate set of neighbors of a given point in finite high-dimensional spaces. Additionally, the hubness problem is considered, which is related to the form of the function Nk representing the number of points that have a given point as one of their k nearest neighbors, which is also called the number of k-occurrences. Despite the extensive use of this function, the precise characterization of its form is a longstanding problem. We derive a closed form for the number of k-occurrences associated with a given point in finite high-dimensional spaces, together with the associated limiting probability distribution. By investigating the relationships with the hubness phenomenon emerging in network science, we find that the distribution of node (in-)degrees of some real-life, large-scale networks has connections with the distribution of k-occurrences described herein.
{"title":"On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness","authors":"F. Angiulli","doi":"10.5555/3122009.3242027","DOIUrl":"https://doi.org/10.5555/3122009.3242027","url":null,"abstract":"Over the years, different characterizations of the curse of dimensionality have been provided, usually stating the conditions under which, in the limit of the infinite dimensionality, distances become indistinguishable. However, these characterizations almost never address the form of associated distributions in the finite, although high-dimensional, case. This work aims to contribute in this respect by investigating the distribution of distances, and of direct and reverse nearest neighbors, in intrinsically high-dimensional spaces. Indeed, we derive a closed form for the distribution of distances from a given point, for the expected distance from a given point to its kth nearest neighbor, and for the expected size of the approximate set of neighbors of a given point in finite high-dimensional spaces. Additionally, the hubness problem is considered, which is related to the form of the function Nk representing the number of points that have a given point as one of their k nearest neighbors, which is also called the number of k-occurrences. Despite the extensive use of this function, the precise characterization of its form is a longstanding problem. We derive a closed form for the number of k-occurrences associated with a given point in finite high-dimensional spaces, together with the associated limiting probability distribution. By investigating the relationships with the hubness phenomenon emerging in network science, we find that the distribution of node (in-)degrees of some real-life, large-scale networks has connections with the distribution of k-occurrences described herein.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"79 1","pages":"170:1-170:60"},"PeriodicalIF":0.0,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90421272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-07-04DOI: 10.4230/LIPIcs.CCC.2017.25
Srinivasan Arunachalam, R. D. Wolf
$ newcommand{eps}{varepsilon} $In learning theory, the VC dimension of a concept class $C$ is the most common way to measure its "richness." In the PAC model $$ ThetaBig(frac{d}{eps} + frac{log(1/delta)}{eps}Big) $$ examples are necessary and sufficient for a learner to output, with probability $1-delta$, a hypothesis $h$ that is $eps$-close to the target concept $c$. In the related agnostic model, where the samples need not come from a $cin C$, we know that $$ ThetaBig(frac{d}{eps^2} + frac{log(1/delta)}{eps^2}Big) $$ examples are necessary and sufficient to output an hypothesis $hin C$ whose error is at most $eps$ worse than the best concept in $C$. Here we analyze quantum sample complexity, where each example is a coherent quantum state. This model was introduced by Bshouty and Jackson, who showed that quantum examples are more powerful than classical examples in some fixed-distribution settings. However, Atici and Servedio, improved by Zhang, showed that in the PAC setting, quantum examples cannot be much more powerful: the required number of quantum examples is $$ OmegaBig(frac{d^{1-eta}}{eps} + d + frac{log(1/delta)}{eps}Big)mbox{ for all }eta> 0. $$ Our main result is that quantum and classical sample complexity are in fact equal up to constant factors in both the PAC and agnostic models. We give two approaches. The first is a fairly simple information-theoretic argument that yields the above two classical bounds and yields the same bounds for quantum sample complexity up to a $log(d/eps)$ factor. We then give a second approach that avoids the log-factor loss, based on analyzing the behavior of the "Pretty Good Measurement" on the quantum state identification problems that correspond to learning. This shows classical and quantum sample complexity are equal up to constant factors.
$ newcommand{eps}{varepsilon} $在学习理论中,概念类$C$的VC维是衡量其“丰富度”的最常用方法。在PAC模型中$$ ThetaBig(frac{d}{eps} + frac{log(1/delta)}{eps}Big) $$示例对于学习者输出来说是必要和充分的,其概率为$1-delta$,假设$h$是$eps$ -接近目标概念$c$。在相关的不可知论模型中,样本不需要来自$cin C$,我们知道$$ ThetaBig(frac{d}{eps^2} + frac{log(1/delta)}{eps^2}Big) $$示例是必要的,足以输出一个假设$hin C$,其误差最多$eps$比$C$中的最佳概念差。这里我们分析量子样本复杂度,其中每个例子都是相干量子态。这个模型是由Bshouty和Jackson提出的,他们表明,在一些固定分布的环境中,量子例子比经典例子更强大。然而,由Zhang改进的Atici和Servedio表明,在PAC设置中,量子样例不能更强大:所需的量子样例数量为$$ OmegaBig(frac{d^{1-eta}}{eps} + d + frac{log(1/delta)}{eps}Big)mbox{ for all }eta> 0. $$。我们的主要结果是,在PAC和不可知论模型中,量子和经典样本复杂性实际上等于常数因子。我们给出了两种方法。第一个是一个相当简单的信息论论证,它产生了上述两个经典边界,并产生了量子样本复杂性的相同边界,直至$log(d/eps)$因子。然后,我们给出了第二种避免对数因子损失的方法,基于分析“相当好的测量”在与学习对应的量子态识别问题上的行为。这表明经典和量子样本的复杂性等于常数因子。
{"title":"Optimal Quantum Sample Complexity of Learning Algorithms","authors":"Srinivasan Arunachalam, R. D. Wolf","doi":"10.4230/LIPIcs.CCC.2017.25","DOIUrl":"https://doi.org/10.4230/LIPIcs.CCC.2017.25","url":null,"abstract":"$ newcommand{eps}{varepsilon} $In learning theory, the VC dimension of a concept class $C$ is the most common way to measure its \"richness.\" In the PAC model $$ ThetaBig(frac{d}{eps} + frac{log(1/delta)}{eps}Big) $$ examples are necessary and sufficient for a learner to output, with probability $1-delta$, a hypothesis $h$ that is $eps$-close to the target concept $c$. In the related agnostic model, where the samples need not come from a $cin C$, we know that $$ ThetaBig(frac{d}{eps^2} + frac{log(1/delta)}{eps^2}Big) $$ examples are necessary and sufficient to output an hypothesis $hin C$ whose error is at most $eps$ worse than the best concept in $C$. \u0000Here we analyze quantum sample complexity, where each example is a coherent quantum state. This model was introduced by Bshouty and Jackson, who showed that quantum examples are more powerful than classical examples in some fixed-distribution settings. However, Atici and Servedio, improved by Zhang, showed that in the PAC setting, quantum examples cannot be much more powerful: the required number of quantum examples is $$ OmegaBig(frac{d^{1-eta}}{eps} + d + frac{log(1/delta)}{eps}Big)mbox{ for all }eta> 0. $$ Our main result is that quantum and classical sample complexity are in fact equal up to constant factors in both the PAC and agnostic models. We give two approaches. The first is a fairly simple information-theoretic argument that yields the above two classical bounds and yields the same bounds for quantum sample complexity up to a $log(d/eps)$ factor. We then give a second approach that avoids the log-factor loss, based on analyzing the behavior of the \"Pretty Good Measurement\" on the quantum state identification problems that correspond to learning. This shows classical and quantum sample complexity are equal up to constant factors.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"6 1","pages":"71:1-71:36"},"PeriodicalIF":0.0,"publicationDate":"2016-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78283883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-06-22DOI: 10.3929/ETHZ-B-000278021
Emilija Perkovic, J. Textor, M. Kalisch, M. Maathuis
We present a graphical criterion for covariate adjustment that is sound and complete for four different classes of causal graphical models: directed acyclic graphs (DAGs), maximum ancestral graphs (MAGs), completed partially directed acyclic graphs (CPDAGs), and partial ancestral graphs (PAGs). Our criterion unifies covariate adjustment for a large set of graph classes. Moreover, we define an explicit set that satisfies our criterion, if there is any set that satisfies our criterion. We also give efficient algorithms for constructing all (minimal) sets that fulfill our criterion, implemented in the R package dagitty. Finally, we discuss the relationship between our criterion and other criteria for adjustment, and we provide a new, elementary soundness proof for the adjustment criterion for DAGs of Shpitser, VanderWeele and Robins (UAI 2010).
{"title":"Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs","authors":"Emilija Perkovic, J. Textor, M. Kalisch, M. Maathuis","doi":"10.3929/ETHZ-B-000278021","DOIUrl":"https://doi.org/10.3929/ETHZ-B-000278021","url":null,"abstract":"We present a graphical criterion for covariate adjustment that is sound and complete for four different classes of causal graphical models: directed acyclic graphs (DAGs), maximum ancestral graphs (MAGs), completed partially directed acyclic graphs (CPDAGs), and partial ancestral graphs (PAGs). Our criterion unifies covariate adjustment for a large set of graph classes. Moreover, we define an explicit set that satisfies our criterion, if there is any set that satisfies our criterion. We also give efficient algorithms for constructing all (minimal) sets that fulfill our criterion, implemented in the R package dagitty. Finally, we discuss the relationship between our criterion and other criteria for adjustment, and we provide a new, elementary soundness proof for the adjustment criterion for DAGs of Shpitser, VanderWeele and Robins (UAI 2010).","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"91 1","pages":"220:1-220:62"},"PeriodicalIF":0.0,"publicationDate":"2016-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82371002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper, we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way, all of approximation is performed at `inference time' rather than at `modelling time' resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.
{"title":"A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation","authors":"T. Bui, Josiah Yan, Richard E. Turner","doi":"10.17863/CAM.20846","DOIUrl":"https://doi.org/10.17863/CAM.20846","url":null,"abstract":"Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper, we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way, all of approximation is performed at `inference time' rather than at `modelling time' resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"12 1","pages":"104:1-104:72"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90442053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-02-13DOI: 10.15496/PUBLIKATION-30501
I. Tolstikhin, Bharath K. Sriperumbudur, Krikamol Muandet
In this paper, we study the minimax estimation of the Bochner integral $$mu_k(P):=int_{mathcal{X}} k(cdot,x),dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:mathcal{X}timesmathcal{X}rightarrowmathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $hat{theta}_n$ of $mu_k(P)$ are studied in the literature wherein all of them satisfy $bigl| hat{theta}_n-mu_k(P)bigr|_{mathcal{H}_k}=O_P(n^{-1/2})$ with $mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{-1/2}$ is minimax in $|cdot|_{mathcal{H}_k}$ and $|cdot|_{L^2(mathbb{R}^d)}$-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translation-invariant kernel on $mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in non-parametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance).
{"title":"Minimax Estimation of Kernel Mean Embeddings","authors":"I. Tolstikhin, Bharath K. Sriperumbudur, Krikamol Muandet","doi":"10.15496/PUBLIKATION-30501","DOIUrl":"https://doi.org/10.15496/PUBLIKATION-30501","url":null,"abstract":"In this paper, we study the minimax estimation of the Bochner integral $$mu_k(P):=int_{mathcal{X}} k(cdot,x),dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:mathcal{X}timesmathcal{X}rightarrowmathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $hat{theta}_n$ of $mu_k(P)$ are studied in the literature wherein all of them satisfy $bigl| hat{theta}_n-mu_k(P)bigr|_{mathcal{H}_k}=O_P(n^{-1/2})$ with $mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{-1/2}$ is minimax in $|cdot|_{mathcal{H}_k}$ and $|cdot|_{L^2(mathbb{R}^d)}$-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translation-invariant kernel on $mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in non-parametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance).","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"126 1","pages":"86:1-86:47"},"PeriodicalIF":0.0,"publicationDate":"2016-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90873170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a simple framework for estimating derivatives without _tting the regression function in nonparametric regression. Unlike most existing methods that use the symmetric difference quotients, our method is constructed as a linear combination of observations. It is hence very exible and applicable to both interior and boundary points, including most existing methods as special cases of ours. Within this framework, we define the variance-minimizing estimators for any order derivative of the regression function with a fixed bias-reduction level. For the equidistant design, we derive the asymptotic variance and bias of these estimators. We also show that our new method will, for the first time, achieve the asymptotically optimal convergence rate for difference-based estimators. Finally, we provide an effective criterion for selection of tuning parameters and demonstrate the usefulness of the proposed method through extensive simulation studies of the first-and second-order derivative estimators.
{"title":"Optimal Estimation of Derivatives in Nonparametric Regression","authors":"Wenlin Dai, T. Tong, M. Genton","doi":"10.5555/2946645.3053446","DOIUrl":"https://doi.org/10.5555/2946645.3053446","url":null,"abstract":"We propose a simple framework for estimating derivatives without _tting the regression function in nonparametric regression. Unlike most existing methods that use the symmetric difference quotients, our method is constructed as a linear combination of observations. It is hence very exible and applicable to both interior and boundary points, including most existing methods as special cases of ours. Within this framework, we define the variance-minimizing estimators for any order derivative of the regression function with a fixed bias-reduction level. For the equidistant design, we derive the asymptotic variance and bias of these estimators. We also show that our new method will, for the first time, achieve the asymptotically optimal convergence rate for difference-based estimators. Finally, we provide an effective criterion for selection of tuning parameters and demonstrate the usefulness of the proposed method through extensive simulation studies of the first-and second-order derivative estimators.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"48 1","pages":"164:1-164:25"},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88020190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}