2010 IEEE 51st Annual Symposium on Foundations of Computer Science最新文献_第6页

Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity 编辑距离和非对称查询复杂度的多对数逼近

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-05-21 DOI: 10.1109/FOCS.2010.43

Alexandr Andoni, Robert Krauthgamer, Krzysztof Onak

We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor. For strings of length $n$ and every fixed $eps>0$, the algorithm computes a $(log n)^{O(1/eps)}$ approximation in $n^{1+eps}$ time. This is an {em exponential} improvement over the previously known approximation factor, $2^{tilde O(sqrt{log n})}$, with a comparable running time [Ostrovsky and Rabani, J. ACM 2007, Andoni and Onak, STOC 2009]. This result arises naturally in the study of a new emph{asymmetric query} model. In this model, the input consists of two strings $x$ and $y$, and an algorithm can access $y$ in an unrestricted manner, while being charged for querying every symbol of $x$. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance.

我们提出了一种近线性时间算法，该算法在多对数因子内近似两个字符串之间的编辑距离。对于长度为$n$的字符串和每个固定的$eps>0$，算法在$n^{1+eps}$时间内计算一个$(log n)^{O(1/eps)}$近似值。与之前已知的近似因子{em}$2^{tilde O(sqrt{log n})}$相比，这是一个级的改进，并且运行时间相当[Ostrovsky and Rabani, J. ACM 2007; Andoni and Onak, STOC 2009]。这个结果在研究新的emph{非对称查询}模型时自然出现。在这个模型中，输入由两个字符串$x$和$y$组成，算法可以不受限制地访问$y$，同时对查询$x$的每个符号收费。实际上，我们通过设计一个算法来获得我们的主要结果，该算法在该模型中进行少量查询。然后，我们提供查询数量的一个几乎匹配的下界。我们的下界首次揭示了编辑距离的硬度，因为输入字符串是“重复的”，这意味着它们的许多子字符串几乎相同。因此，我们的下界提供了编辑距离和Ulam距离之间的第一个严格的分离。

{"title":"Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity","authors":"Alexandr Andoni, Robert Krauthgamer, Krzysztof Onak","doi":"10.1109/FOCS.2010.43","DOIUrl":"https://doi.org/10.1109/FOCS.2010.43","url":null,"abstract":"We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor. For strings of length $n$ and every fixed $eps>0$, the algorithm computes a $(log n)^{O(1/eps)}$ approximation in $n^{1+eps}$ time. This is an {em exponential} improvement over the previously known approximation factor, $2^{tilde O(sqrt{log n})}$, with a comparable running time [Ostrovsky and Rabani, J. ACM 2007, Andoni and Onak, STOC 2009]. This result arises naturally in the study of a new emph{asymmetric query} model. In this model, the input consists of two strings $x$ and $y$, and an algorithm can access $y$ in an unrestricted manner, while being charged for querying every symbol of $x$. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115722193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

Lower Bounds on Near Neighbor Search via Metric Expansion 基于度量展开的近邻搜索下界

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-05-03 DOI: 10.1109/FOCS.2010.82

R. Panigrahy, Kunal Talwar, Udi Wieder

In this paper we show how the complexity of performing nearest neighbor (NNS) search on a metric space is related to the expansion of the metric space. Given a metric space we look at the graph obtained by connecting every pair of points within a certain distance $r$ . We then look at various notions of expansion in this graph relating them to the cell probe complexity of NNS for randomized and deterministic, exact and approximate algorithms. For example if the graph has node expansion $Phi$ then we show that any deterministic $t$-probe data structure for $n$ points must use space $S$ where $(St/n)^t > Phi$. We show similar results for randomized algorithms as well. These relationships can be used to derive most of the known lower bounds in the well known metric spaces such as $l_1$, $l_2$, $l_infty$, and some new ones, by simply computing their expansion. In the process, we strengthen and generalize our previous results~cite{PTW08}. Additionally, we unify the approach in~cite{PTW08} and the communication complexity based approach. Our work reduces the problem of proving cell probe lower bounds of near neighbor search to computing the appropriate expansion parameter. In our results, as in all previous results, the dependence on $t$ is weak, that is, the bound drops exponentially in $t$. We show a much stronger (tight) time-space tradeoff for the class of emph{dynamic} emph{low contention} data structures. These are data structures that supports updates in the data set and that do not look up any single cell too often. A full version of the paper could be found in [19].

在本文中，我们展示了在度量空间上执行最近邻(NNS)搜索的复杂度如何与度量空间的扩展相关。给定一个度量空间，我们看通过在一定距离内连接每对点得到的图$r$。然后，我们看看这个图中的各种扩展概念，将它们与随机和确定性、精确和近似算法的神经网络的细胞探针复杂性联系起来。例如，如果图具有节点展开$Phi$，那么我们表明$n$点的任何确定性$t$ -探针数据结构都必须使用空间$S$，其中$(St/n)^t > Phi$。我们在随机算法中也显示了类似的结果。这些关系可以用来推导出众所周知的度量空间(如$l_1$、$l_2$、$l_infty$和一些新的度量空间)中的大多数已知下界，只需计算它们的展开即可。在这个过程中，我们加强和推广我们以前的结果cite{PTW08}。此外，我们将cite{PTW08}中的方法与基于通信复杂度的方法统一起来。我们的工作将证明近邻搜索的单元探针下界的问题简化为计算适当的展开参数。在我们的结果中，与所有以前的结果一样，对$t$的依赖性很弱，也就是说，$t$的界呈指数级下降。我们展示了emph{动态}emph{低争用}数据结构类更强(紧密)的时间-空间权衡。这些数据结构支持数据集中的更新，并且不会过于频繁地查找任何单个单元格。该论文的完整版本可以在[19]中找到。

{"title":"Lower Bounds on Near Neighbor Search via Metric Expansion","authors":"R. Panigrahy, Kunal Talwar, Udi Wieder","doi":"10.1109/FOCS.2010.82","DOIUrl":"https://doi.org/10.1109/FOCS.2010.82","url":null,"abstract":"In this paper we show how the complexity of performing nearest neighbor (NNS) search on a metric space is related to the expansion of the metric space. Given a metric space we look at the graph obtained by connecting every pair of points within a certain distance $r$ . We then look at various notions of expansion in this graph relating them to the cell probe complexity of NNS for randomized and deterministic, exact and approximate algorithms. For example if the graph has node expansion $Phi$ then we show that any deterministic $t$-probe data structure for $n$ points must use space $S$ where $(St/n)^t > Phi$. We show similar results for randomized algorithms as well. These relationships can be used to derive most of the known lower bounds in the well known metric spaces such as $l_1$, $l_2$, $l_infty$, and some new ones, by simply computing their expansion. In the process, we strengthen and generalize our previous results~cite{PTW08}. Additionally, we unify the approach in~cite{PTW08} and the communication complexity based approach. Our work reduces the problem of proving cell probe lower bounds of near neighbor search to computing the appropriate expansion parameter. In our results, as in all previous results, the dependence on $t$ is weak, that is, the bound drops exponentially in $t$. We show a much stronger (tight) time-space tradeoff for the class of emph{dynamic} emph{low contention} data structures. These are data structures that supports updates in the data set and that do not look up any single cell too often. A full version of the paper could be found in [19].","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124595660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

Polynomial Learning of Distribution Families 分布族的多项式学习

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-27 DOI: 10.1137/13090818X

M. Belkin, Kaushik Sinha

The question of polynomial learn ability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learn ability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learn ability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. Specifically, we show that parameters of a Gaussian mixture distribution with fixed number of components can be learned using a sample whose size is polynomial in dimension and all other parameters. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call “polynomial families” in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.

概率分布的多项式学习能力问题，特别是高斯混合分布的多项式学习能力问题，近年来在理论计算机科学和机器学习领域受到了极大的关注。然而，尽管取得了重大进展，但高斯混合分布的多项式学习能力这一普遍问题仍然悬而未决。本工作解决了高维任意固定分量高斯混合的多项式学习能力问题。具体地说，我们证明了具有固定数量分量的高斯混合分布的参数可以使用一个尺寸为多项式的样本和所有其他参数来学习。学习高斯混合的结果依赖于对低维“多项式族”分布的分析。这些族的特点是它们的矩在参数上是多项式的，包括几乎所有常见的概率分布以及它们的混合物和乘积。利用真实代数几何的工具，我们证明了任何属于这类分布的参数都可以在多项式时间内学习，并且使用多项式数量的样本点。学习多项式族的结果是非常普遍的，并且具有独立的意义。为了估计高维高斯混合分布的参数，我们提供了一种确定性降维算法。这使我们能够将高维混合的学习简化为低维参数估计的多项式个数。将这种简化与多项式族的结果结合起来，就得到了我们在高维学习任意高斯混合的结果。

{"title":"Polynomial Learning of Distribution Families","authors":"M. Belkin, Kaushik Sinha","doi":"10.1137/13090818X","DOIUrl":"https://doi.org/10.1137/13090818X","url":null,"abstract":"The question of polynomial learn ability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learn ability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learn ability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. Specifically, we show that parameters of a Gaussian mixture distribution with fixed number of components can be learned using a sample whose size is polynomial in dimension and all other parameters. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call “polynomial families” in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 211

Settling the Polynomial Learnability of Mixtures of Gaussians 求解高斯混合多项式的可学习性

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-23 DOI: 10.1109/FOCS.2010.15

Ankur Moitra, G. Valiant

Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We give an algorithm for this problem that has running time and data requirements polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on the Gaussians. As a simple consequence of our learning algorithm, we we give the first polynomial time algorithm for proper density estimation for mixtures of k Gaussians that needs no assumptions on the mixture. It was open whether proper density estimation was even statistically possible (with no assumptions) given only polynomially many samples, let alone whether it could be computationally efficient. The building blocks of our algorithm are based on the work (Kalai emph{et al}, STOC 2010) that gives an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down to one dimension, and applying the method of moments to each univariate projection. A major technical hurdle in the previous work is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, because pathological scenarios can arise when considering projections of mixtures of more than two Gaussians, the bulk of the work in this paper concerns how to leverage a weaker algorithm for learning univariate mixtures (of many Gaussians) to learn in high dimensions. Our algorithm employs hierarchical clustering and rescaling, together with methods for backtracking and recovering from the failures that can occur in our univariate algorithm. Finally, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.

给定从多元高斯分布中提取的混合数据，一个基本问题是准确估计混合参数。对于这个问题，我们给出了一个算法，该算法具有运行时间和数据需求的多项式维度和所需精度的逆，并且可以证明对高斯函数的假设最小。作为我们学习算法的一个简单结果，我们给出了第一个多项式时间算法，用于k高斯混合的适当密度估计，不需要对混合物进行假设。在没有假设的情况下，仅仅给出多项式数量的样本，正确的密度估计在统计上是否可能是开放的，更不用说它在计算上是否有效了。我们算法的构建模块基于工作(Kalaiemph{等}人，STOC 2010)，该工作通过考虑一系列降至一维的投影，并将矩量方法应用于每个单变量投影，给出了学习两个高斯混合的有效算法。先前工作中的一个主要技术障碍是表明人们可以有效地学习两个高斯函数的单变量混合。相比之下，由于在考虑两个以上高斯混合的投影时可能会出现病态情况，因此本文的大部分工作涉及如何利用较弱的算法来学习单变量混合(许多高斯)以在高维中学习。我们的算法采用分层聚类和重新缩放，以及回溯和从单变量算法中可能发生的故障中恢复的方法。最后，虽然我们的算法的运行时间和数据需求指数依赖于混合中的高斯数，但我们证明了这种依赖是必要的。

{"title":"Settling the Polynomial Learnability of Mixtures of Gaussians","authors":"Ankur Moitra, G. Valiant","doi":"10.1109/FOCS.2010.15","DOIUrl":"https://doi.org/10.1109/FOCS.2010.15","url":null,"abstract":"Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We give an algorithm for this problem that has running time and data requirements polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on the Gaussians. As a simple consequence of our learning algorithm, we we give the first polynomial time algorithm for proper density estimation for mixtures of k Gaussians that needs no assumptions on the mixture. It was open whether proper density estimation was even statistically possible (with no assumptions) given only polynomially many samples, let alone whether it could be computationally efficient. The building blocks of our algorithm are based on the work (Kalai emph{et al}, STOC 2010) that gives an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down to one dimension, and applying the method of moments to each univariate projection. A major technical hurdle in the previous work is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, because pathological scenarios can arise when considering projections of mixtures of more than two Gaussians, the bulk of the work in this paper concerns how to leverage a weaker algorithm for learning univariate mixtures (of many Gaussians) to learn in high dimensions. Our algorithm employs hierarchical clustering and rescaling, together with methods for backtracking and recovering from the failures that can occur in our univariate algorithm. Finally, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129517446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 319

Codes for Computationally Simple Channels: Explicit Constructions with Optimal Rate 计算简单信道的编码:具有最优速率的显式结构

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-22 DOI: 10.1109/FOCS.2010.74

V. Guruswami, Adam D. Smith

In this paper, we consider coding schemes for computationally bounded channels, which can introduce an arbitrary set of errors as long as (a) the fraction of errors is bounded with high probability by a parameter p and (b) the process which adds the errors can be described by a sufficiently "simple" circuit. Codes for such channel models are attractive since, like codes for standard adversarial errors, they can handle channels whose true behavior is unknown or varying over time. For three classes of channels, we provide explicit, efficiently encodable/decodable codes of optimal rate where only inefficiently decodable codes were previously known. In each case, we provide one encoder/decoder that works for every channel in the class. Unique decoding for additive errors: We give the first construction of a poly-time encodable/decodable code for additive (a.k.a. oblivious) channels that achieve the Shannon capacity 1-H(p). List-decoding for online log-space channels: A space-S(N) bounded channel reads and modifies the transmitted codeword as a stream, using at most S(N) bits of workspace on transmissions of N bits. For constant S, this captures many models from the literature, including "discrete channels with finite memory" and "arbitrarily varying channels". We give an efficient code with optimal rate (arbitrarily close to 1-H(p)) that recovers a short list containing the correct message with high probability for channels which read and modify the transmitted codeword as a stream, using at most O(log N) bits of workspace on transmissions of N bits. List-decoding for poly-time channels: For any constant c we give a similar list-decoding result for channels describable by circuits of size at most N^c, assuming the existence of pseudorandom generators.

在本文中，我们考虑了计算有界信道的编码方案，它可以引入任意一组误差，只要(a)误差的分数有一个高概率的参数p和(b)增加误差的过程可以用一个足够“简单”的电路来描述。这种信道模型的代码很有吸引力，因为与标准对抗性错误的代码一样，它们可以处理真实行为未知或随时间变化的信道。对于三种类型的信道，我们提供了明确的，有效的可编/可解码码的最佳速率，而以前只知道无效的可解码码。在每种情况下，我们都提供一个编码器/解码器，适用于类中的每个通道。加性错误的唯一解码:我们给出了实现香农容量1-H(p)的加性(又称遗忘)信道的多时间可编码/可解码代码的第一个结构。在线对数空间通道的列表解码:空间S(N)有限的通道作为流读取和修改传输的码字，在传输N位时最多使用S(N)位的工作空间。对于常数S，这捕获了文献中的许多模型，包括“有限存储的离散通道”和“任意变化的通道”。我们给出了一个具有最优速率(任意接近1-H(p))的有效代码，该代码以高概率恢复包含正确消息的短列表，这些信道将读取和修改传输的码字作为流，在传输N位时最多使用O(log N)位工作空间。多时间信道的列表解码:对于任意常数c，我们给出了一个类似的列表解码结果，对于由大小至多为N^c的电路描述的信道，假设存在伪随机发生器。

{"title":"Codes for Computationally Simple Channels: Explicit Constructions with Optimal Rate","authors":"V. Guruswami, Adam D. Smith","doi":"10.1109/FOCS.2010.74","DOIUrl":"https://doi.org/10.1109/FOCS.2010.74","url":null,"abstract":"In this paper, we consider coding schemes for computationally bounded channels, which can introduce an arbitrary set of errors as long as (a) the fraction of errors is bounded with high probability by a parameter p and (b) the process which adds the errors can be described by a sufficiently \"simple\" circuit. Codes for such channel models are attractive since, like codes for standard adversarial errors, they can handle channels whose true behavior is unknown or varying over time. For three classes of channels, we provide explicit, efficiently encodable/decodable codes of optimal rate where only inefficiently decodable codes were previously known. In each case, we provide one encoder/decoder that works for every channel in the class. Unique decoding for additive errors: We give the first construction of a poly-time encodable/decodable code for additive (a.k.a. oblivious) channels that achieve the Shannon capacity 1-H(p). List-decoding for online log-space channels: A space-S(N) bounded channel reads and modifies the transmitted codeword as a stream, using at most S(N) bits of workspace on transmissions of N bits. For constant S, this captures many models from the literature, including \"discrete channels with finite memory\" and \"arbitrarily varying channels\". We give an efficient code with optimal rate (arbitrarily close to 1-H(p)) that recovers a short list containing the correct message with high probability for channels which read and modify the transmitted codeword as a stream, using at most O(log N) bits of workspace on transmissions of N bits. List-decoding for poly-time channels: For any constant c we give a similar list-decoding result for channels describable by circuits of size at most N^c, assuming the existence of pseudorandom generators.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132877267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Efficient Volume Sampling for Row/Column Subset Selection 行/列子集选择的高效体积采样

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-22 DOI: 10.1109/FOCS.2010.38

A. Deshpande, Luis Rademacher

We give efficient algorithms for volume sampling, i.e., for picking $k$-subsets of the rows of any given matrix with probabilities proportional to the squared volumes of the simplices defined by them and the origin (or the squared volumes of the parallelepipeds defined by these subsets of rows). %In other words, we can efficiently sample $k$-subsets of $[m]$ with probabilities proportional to the corresponding $k$ by $k$ principal minors of any given $m$ by $m$ positive semi definite matrix. This solves an open problem from the monograph on spectral algorithms by Kannan and Vempala (see Section $7.4$ of cite{KV}, also implicit in cite{BDM, DRVW}). Our first algorithm for volume sampling $k$-subsets of rows from an $m$-by-$n$ matrix runs in $O(kmn^omega log n)$ arithmetic operations (where $omega$ is the exponent of matrix multiplication) and a second variant of it for $(1+eps)$-approximate volume sampling runs in $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ arithmetic operations, which is almost linear in the size of the input (i.e., the number of entries) for small $k$. Our efficient volume sampling algorithms imply the following results for low-rank matrix approximation: (1) Given $A in reals^{m times n}$, in $O(kmn^{omega} log n)$ arithmetic operations we can find $k$ of its rows such that projecting onto their span gives a $sqrt{k+1}$-approximation to the matrix of rank $k$ closest to $A$ under the Frobenius norm. This improves the $O(k sqrt{log k})$-approximation of Boutsidis, Drineas and Mahoney cite{BDM} and matches the lower bound shown in cite{DRVW}. The method of conditional expectations gives a emph{deterministic} algorithm with the same complexity. The running time can be improved to $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ at the cost of losing an extra $(1+eps)$ in the approximation factor. (2) The same rows and projection as in the previous point give a $sqrt{(k+1)(n-k)}$-approximation to the matrix of rank $k$ closest to $A$ under the spectral norm. In this paper, we show an almost matching lower bound of $sqrt{n}$, even for $k=1$.

我们给出了有效的体积采样算法，即选择任意给定矩阵的$k$ -行子集，其概率与由它们定义的简单体和原点的平方体积成正比(或由这些行子集定义的平行六面体的平方体积)。 %In other words, we can efficiently sample $k$-subsets of $[m]$ with probabilities proportional to the corresponding $k$ by $k$ principal minors of any given $m$ by $m$ positive semi definite matrix. This solves an open problem from the monograph on spectral algorithms by Kannan and Vempala (see Section $7.4$ of cite{KV}, also implicit in cite{BDM, DRVW}). Our first algorithm for volume sampling $k$-subsets of rows from an $m$-by-$n$ matrix runs in $O(kmn^omega log n)$ arithmetic operations (where $omega$ is the exponent of matrix multiplication) and a second variant of it for $(1+eps)$-approximate volume sampling runs in $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ arithmetic operations, which is almost linear in the size of the input (i.e., the number of entries) for small $k$. Our efficient volume sampling algorithms imply the following results for low-rank matrix approximation: (1) Given $A in reals^{m times n}$, in $O(kmn^{omega} log n)$ arithmetic operations we can find $k$ of its rows such that projecting onto their span gives a $sqrt{k+1}$-approximation to the matrix of rank $k$ closest to $A$ under the Frobenius norm. This improves the $O(k sqrt{log k})$-approximation of Boutsidis, Drineas and Mahoney cite{BDM} and matches the lower bound shown in cite{DRVW}. The method of conditional expectations gives a emph{deterministic} algorithm with the same complexity. The running time can be improved to $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ at the cost of losing an extra $(1+eps)$ in the approximation factor. (2) The same rows and projection as in the previous point give a $sqrt{(k+1)(n-k)}$-approximation to the matrix of rank $k$ closest to $A$ under the spectral norm. In this paper, we show an almost matching lower bound of $sqrt{n}$, even for $k=1$.

{"title":"Efficient Volume Sampling for Row/Column Subset Selection","authors":"A. Deshpande, Luis Rademacher","doi":"10.1109/FOCS.2010.38","DOIUrl":"https://doi.org/10.1109/FOCS.2010.38","url":null,"abstract":"We give efficient algorithms for volume sampling, i.e., for picking $k$-subsets of the rows of any given matrix with probabilities proportional to the squared volumes of the simplices defined by them and the origin (or the squared volumes of the parallelepipeds defined by these subsets of rows). %In other words, we can efficiently sample $k$-subsets of $[m]$ with probabilities proportional to the corresponding $k$ by $k$ principal minors of any given $m$ by $m$ positive semi definite matrix. This solves an open problem from the monograph on spectral algorithms by Kannan and Vempala (see Section $7.4$ of cite{KV}, also implicit in cite{BDM, DRVW}). Our first algorithm for volume sampling $k$-subsets of rows from an $m$-by-$n$ matrix runs in $O(kmn^omega log n)$ arithmetic operations (where $omega$ is the exponent of matrix multiplication) and a second variant of it for $(1+eps)$-approximate volume sampling runs in $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ arithmetic operations, which is almost linear in the size of the input (i.e., the number of entries) for small $k$. Our efficient volume sampling algorithms imply the following results for low-rank matrix approximation: (1) Given $A in reals^{m times n}$, in $O(kmn^{omega} log n)$ arithmetic operations we can find $k$ of its rows such that projecting onto their span gives a $sqrt{k+1}$-approximation to the matrix of rank $k$ closest to $A$ under the Frobenius norm. This improves the $O(k sqrt{log k})$-approximation of Boutsidis, Drineas and Mahoney cite{BDM} and matches the lower bound shown in cite{DRVW}. The method of conditional expectations gives a emph{deterministic} algorithm with the same complexity. The running time can be improved to $O(mn log m cdot k^{2}/eps^{2} + m log^{omega} m cdot k^{2omega+1}/eps^{2omega} cdot log(k eps^{-1} log m))$ at the cost of losing an extra $(1+eps)$ in the approximation factor. (2) The same rows and projection as in the previous point give a $sqrt{(k+1)(n-k)}$-approximation to the matrix of rank $k$ closest to $A$ under the spectral norm. In this paper, we show an almost matching lower bound of $sqrt{n}$, even for $k=1$.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122498802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 195

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition 增强索引与流语言识别的信息成本权衡

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-19 DOI: 10.1109/FOCS.2010.44

Amit Chakrabarti, Graham Cormode, Ranganath Kondapally, A. Mcgregor

This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTED-INDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTED-INDEX may violate the ``rectangle property'' due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] on the multi-pass complexity of recognizing Dyck languages. This results in a natural separation between the standard multi-pass model and the multi-pass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and double-ended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important sub-class of the memory checking framework of Blum et al. [Algorithmica, 1994].

本文在通信复杂性和流计算理论方面做出了三个主要贡献。首先，给出了增广索引信息复杂度的新边界。与Jain, Radhakrishnan和Sen对INDEX的类似结果相反[J]。ACM, 2009]，我们必须克服一个重大的技术挑战，即由于固有的输入共享，增强索引协议可能违反“矩形属性”。其次，我们使用这些边界来解决Magniez, Mathieu和Nayak [STOC, 2010]关于识别Dyck语言的多通道复杂性的开放问题。这导致了标准多通道模型和允许反向通道的多通道模型之间的自然分离。第三，我们提出了第一个被动内存检查器，用于验证优先级队列、堆栈和双端队列的交互记录。我们获得了这些问题的紧上界和下界，从而解决了Blum等人[Algorithmica, 1994]的内存检查框架的一个重要子类。

引用次数: 32

One Tree Suffices: A Simultaneous O(1)-Approximation for Single-Sink Buy-at-Bulk 一棵树就足够了:单汇批量购买的同时O(1)逼近

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-13 DOI: 10.4086/toc.2012.v008a015

Ashish Goel, Ian Post

We study the single-sink buy-at-bulk problem with an unknown cost function. We wish to route flow from a set of demand nodes to a root node, where the cost of routing x total flow along an edge is proportional to f(x) for some concave, non-decreasing function f satisfying f(0)=0. We present a simple, fast, combinatorial algorithm that takes a set of demands and constructs a single tree T such that for all f the cost f(T) is a 47.45-approximation of the optimal cost for that f. This is within a factor of 2.33 of the best approximation ratio currently achievable when the tree can be optimized for a specific function. Trees achieving simultaneous O(1)-approximations for all concave functions were previously not known to exist regardless of computation time.

研究了成本函数未知的单汇批量采购问题。我们希望将流量从一组需求节点路由到一个根节点，其中对于满足f(0)=0的凹非递减函数f，沿边缘路由x总流量的成本与f(x)成正比。我们提出了一种简单，快速的组合算法，该算法采用一组需求并构建单个树T，使所有f的成本f(T)是该f的最佳成本的47.45近似值。这是目前可实现的最佳近似值的2.33倍，当树可以针对特定函数进行优化时。无论计算时间如何，以前不知道存在同时实现所有凹函数O(1)逼近的树。

引用次数: 10

Clustering with Spectral Norm and the k-Means Algorithm 谱范数聚类与k-均值算法

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-11 DOI: 10.1109/FOCS.2010.35

Amit Kumar, R. Kannan

There has been much progress on efficient algorithms for clustering data points generated by a mixture of k probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition'': the projection of any data point onto the line joining its cluster center to any other cluster center is Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known k-means algorithm, and along the way, we prove a result of independent interest – that the k-means algorithm converges to the "true centers'' even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning certain mixture of distributions under weaker separation conditions.

在假设分布的均值分离良好(即任意两个分布的均值之间的距离至少为ω (k)个标准差)的情况下，对由k个概率分布的混合产生的数据点进行聚类的有效算法已经取得了很大进展。这些结果通常大量使用生成模型和分布的特殊性质。在本文中，我们证明了一个简单的聚类算法不需要假设任何生成(概率)模型。我们唯一的假设是我们所谓的“接近条件”:任何数据点在连接其集群中心和任何其他集群中心的直线上的投影是Omega(k)个标准差，比另一个中心更接近它自己的中心。在这里，标准差的概念是基于矩阵的谱范数，矩阵的行表示一个点与其所属簇的均值之间的差。我们表明，在所研究的生成模型中，我们的接近条件是满足的，因此我们能够推导出大多数已知的生成模型的结果作为我们主要结果的推论。我们还证明了生成模型的一些新结果——例如，我们可以聚类除了一小部分点之外的所有点，只需要假设方差的一个界。我们的算法依赖于众所周知的k-means算法，并且在此过程中，我们证明了一个独立的结果-即使在存在伪点的情况下，k-means算法收敛到“真实中心”，前提是初始(估计)中心与相应的实际中心足够接近，并且除了一小部分点之外，所有点都满足接近条件。最后，我们提出了一种提高中心间距与标准差比值的新技术。这使我们能够证明在较弱分离条件下学习某些分布混合物的结果。

{"title":"Clustering with Spectral Norm and the k-Means Algorithm","authors":"Amit Kumar, R. Kannan","doi":"10.1109/FOCS.2010.35","DOIUrl":"https://doi.org/10.1109/FOCS.2010.35","url":null,"abstract":"There has been much progress on efficient algorithms for clustering data points generated by a mixture of k probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a \"proximity condition'': the projection of any data point onto the line joining its cluster center to any other cluster center is Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known k-means algorithm, and along the way, we prove a result of independent interest – that the k-means algorithm converges to the \"true centers'' even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning certain mixture of distributions under weaker separation conditions.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131593020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 188

Optimal Stochastic Planarization 最优随机平面化

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Pub Date : 2010-04-09 DOI: 10.1109/FOCS.2010.23

Anastasios Sidiropoulos

It has been shown by Indyk and Sidiropoulos cite{indyk_genus} that any graph of genus $g>0$ can be stochastically embedded into a distribution over planar graphs with distortion $2^{O(g)}$. This bound was later improved to $O(g^2)$ by Borradaile, Lee and Sidiropoulos cite{BLS09}. We give an embedding with distortion $O(log g)$, which is asymptotically optimal. Apart from the improved distortion, another advantage of our embedding is that it can be computed in polynomial time. In contrast, the algorithm of cite{BLS09} requires solving an NP-hard problem. Our result implies in particular a reduction for a large class of geometric optimization problems from instances on genus-$g$ graphs, to corresponding ones on planar graphs, with a $O(log g)$ loss factor in the approximation guarantee.

Indyk和Sidiropoulos cite{indyk_genus}已经证明，任何属$g>0$的图都可以随机嵌入到具有畸变$2^{O(g)}$的平面图上的分布中。后来由Borradaile, Lee和Sidiropoulos改进为$O(g^2)$cite{BLS09}。我们给出了一个带失真$O(log g)$的渐近最优嵌入。除了改善失真，我们的嵌入的另一个优点是它可以在多项式时间内计算。相比之下，cite{BLS09}算法需要解决np困难问题。我们的结果特别暗示了一类几何优化问题的简化，从属- $g$图上的实例，到相应的平面图上的实例，在近似保证中有一个$O(log g)$损失因子。

引用次数: 10