The problem of approximately computing the k dominant Fourier coefficients of a vector X quickly, and using few samples in time domain, is known as the Sparse Fourier Transform (sparse FFT) problem. A long line of work on the sparse FFT has resulted in algorithms with O(klognlog(n/k)) runtime [Hassanieh et al., STOC'12] and O(klogn) sample complexity [Indyk et al., FOCS'14]. This paper revisits the sparse FFT problem with the added twist that the sparse coefficients approximately obey a (k0,k1)-block sparse model. In this model, signal frequencies are clustered in k0 intervals with width k1 in Fourier space, and k= k0k1 is the total sparsity. Our main result is the first sparse FFT algorithm for (k0, k1)-block sparse signals with a sample complexity of O*(k0k1 + k0log(1+ k0)logn) at constant signal-to-noise ratios, and sublinear runtime. Our algorithm crucially uses adaptivity to achieve the improved sample complexity bound, and we provide a lower bound showing that this is essential in the Fourier setting: Any non-adaptive algorithm must use Ω(k0k1logn/k0k1) samples for the (k0,k1)-block sparse model, ruling out improvements over the vanilla sparsity assumption. Our main technical innovation for adaptivity is a new randomized energy-based importance sampling technique that may be of independent interest.
{"title":"An adaptive sublinear-time block sparse fourier transform","authors":"V. Cevher, M. Kapralov, J. Scarlett, A. Zandieh","doi":"10.1145/3055399.3055462","DOIUrl":"https://doi.org/10.1145/3055399.3055462","url":null,"abstract":"The problem of approximately computing the k dominant Fourier coefficients of a vector X quickly, and using few samples in time domain, is known as the Sparse Fourier Transform (sparse FFT) problem. A long line of work on the sparse FFT has resulted in algorithms with O(klognlog(n/k)) runtime [Hassanieh et al., STOC'12] and O(klogn) sample complexity [Indyk et al., FOCS'14]. This paper revisits the sparse FFT problem with the added twist that the sparse coefficients approximately obey a (k0,k1)-block sparse model. In this model, signal frequencies are clustered in k0 intervals with width k1 in Fourier space, and k= k0k1 is the total sparsity. Our main result is the first sparse FFT algorithm for (k0, k1)-block sparse signals with a sample complexity of O*(k0k1 + k0log(1+ k0)logn) at constant signal-to-noise ratios, and sublinear runtime. Our algorithm crucially uses adaptivity to achieve the improved sample complexity bound, and we provide a lower bound showing that this is essential in the Fourier setting: Any non-adaptive algorithm must use Ω(k0k1logn/k0k1) samples for the (k0,k1)-block sparse model, ruling out improvements over the vanilla sparsity assumption. Our main technical innovation for adaptivity is a new randomized energy-based importance sampling technique that may be of independent interest.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89031162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of algorithmic self-assembly is concerned with the computational and expressive power of nanoscale self-assembling molecular systems. In the well-studied cooperative, or temperature 2, abstract tile assembly model it is known that there is a tile set to simulate any Turing machine and an intrinsically universal tile set that simulates the shapes and dynamics of any instance of the model, up to spatial rescaling. It has been an open question as to whether the seemingly simpler noncooperative, or temperature 1, model is capable of such behaviour. Here we show that this is not the case by showing that there is no tile set in the noncooperative model that is intrinsically universal, nor one capable of time-bounded Turing machine simulation within a bounded region of the plane. Although the noncooperative model intuitively seems to lack the complexity and power of the cooperative model it has been exceedingly hard to prove this. One reason is that there have been few tools to analyse the structure of complicated paths in the plane. This paper provides a number of such tools. A second reason is that almost every obvious and small generalisation to the model (e.g. allowing error, 3D, non-square tiles, signals/wires on tiles, tiles that repel each other, parallel synchronous growth) endows it with great computational, and sometimes simulation, power. Our main results show that all of these generalisations provably increase computational and/or simulation power. Our results hold for both deterministic and nondeterministic noncooperative systems. Our first main result stands in stark contrast with the fact that for both the cooperative tile assembly model, and for 3D noncooperative tile assembly, there are respective intrinsically universal tilesets. Our second main result gives a new technique (reduction to simulation) for proving negative results about computation in tile assembly.
{"title":"The non-cooperative tile assembly model is not intrinsically universal or capable of bounded Turing machine simulation","authors":"Pierre-Etienne Meunier, D. Woods","doi":"10.1145/3055399.3055446","DOIUrl":"https://doi.org/10.1145/3055399.3055446","url":null,"abstract":"The field of algorithmic self-assembly is concerned with the computational and expressive power of nanoscale self-assembling molecular systems. In the well-studied cooperative, or temperature 2, abstract tile assembly model it is known that there is a tile set to simulate any Turing machine and an intrinsically universal tile set that simulates the shapes and dynamics of any instance of the model, up to spatial rescaling. It has been an open question as to whether the seemingly simpler noncooperative, or temperature 1, model is capable of such behaviour. Here we show that this is not the case by showing that there is no tile set in the noncooperative model that is intrinsically universal, nor one capable of time-bounded Turing machine simulation within a bounded region of the plane. Although the noncooperative model intuitively seems to lack the complexity and power of the cooperative model it has been exceedingly hard to prove this. One reason is that there have been few tools to analyse the structure of complicated paths in the plane. This paper provides a number of such tools. A second reason is that almost every obvious and small generalisation to the model (e.g. allowing error, 3D, non-square tiles, signals/wires on tiles, tiles that repel each other, parallel synchronous growth) endows it with great computational, and sometimes simulation, power. Our main results show that all of these generalisations provably increase computational and/or simulation power. Our results hold for both deterministic and nondeterministic noncooperative systems. Our first main result stands in stark contrast with the fact that for both the cooperative tile assembly model, and for 3D noncooperative tile assembly, there are respective intrinsically universal tilesets. Our second main result gives a new technique (reduction to simulation) for proving negative results about computation in tile assembly.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80541816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For every constant ε>0, we give an exp(Õ(∞n))-time algorithm for the 1 vs 1 - ε Best Separable State (BSS) problem of distinguishing, given an n2 x n2 matrix ℳ corresponding to a quantum measurement, between the case that there is a separable (i.e., non-entangled) state ρ that ℳ accepts with probability 1, and the case that every separable state is accepted with probability at most 1 - ε. Equivalently, our algorithm takes the description of a subspace 𝒲 ⊆ 𝔽n2 (where 𝔽 can be either the real or complex field) and distinguishes between the case that contains a rank one matrix, and the case that every rank one matrix is at least ε far (in 𝓁2 distance) from 𝒲. To the best of our knowledge, this is the first improvement over the brute-force exp(n)-time algorithm for this problem. Our algorithm is based on the sum-of-squares hierarchy and its analysis is inspired by Lovett's proof (STOC '14, JACM '16) that the communication complexity of every rank-n Boolean matrix is bounded by Õ(√n).
对于ε>0的每一个常数,我们给出了一个exp(Õ(∞n))时间算法来解决1 vs 1 - ε最佳可分离状态(BSS)问题,该问题给出了一个与量子测量相对应的n2 x n2矩阵,在存在一个可分离(即非纠缠)状态ρ且其接受概率为1的情况下,与存在一个可分离(即非纠缠)状态ρ且其接受概率不超过1 - ε的情况下进行区分。同样地,我们的算法取一子空间𝒲≥𝔽n2(其中的∈可以是实域也可以是复域)的描述,并区分包含一个秩1矩阵的情况,以及每个秩1矩阵离𝒲至少有ε远(以𝓁2为距离)的情况。据我们所知,这是针对该问题的暴力破解exp(n)时间算法的第一个改进。我们的算法基于平方和层次结构,其分析灵感来自于Lovett的证明(STOC '14, JACM '16),即每个n阶布尔矩阵的通信复杂性都由Õ(√n)限制。
{"title":"Quantum entanglement, sum of squares, and the log rank conjecture","authors":"B. Barak, Pravesh Kothari, David Steurer","doi":"10.1145/3055399.3055488","DOIUrl":"https://doi.org/10.1145/3055399.3055488","url":null,"abstract":"For every constant ε>0, we give an exp(Õ(∞n))-time algorithm for the 1 vs 1 - ε Best Separable State (BSS) problem of distinguishing, given an n2 x n2 matrix ℳ corresponding to a quantum measurement, between the case that there is a separable (i.e., non-entangled) state ρ that ℳ accepts with probability 1, and the case that every separable state is accepted with probability at most 1 - ε. Equivalently, our algorithm takes the description of a subspace 𝒲 ⊆ 𝔽n2 (where 𝔽 can be either the real or complex field) and distinguishes between the case that contains a rank one matrix, and the case that every rank one matrix is at least ε far (in 𝓁2 distance) from 𝒲. To the best of our knowledge, this is the first improvement over the brute-force exp(n)-time algorithm for this problem. Our algorithm is based on the sum-of-squares hierarchy and its analysis is inspired by Lovett's proof (STOC '14, JACM '16) that the communication complexity of every rank-n Boolean matrix is bounded by Õ(√n).","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83789449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We formalize a framework of algebraically natural lower bounds for algebraic circuits. Just as with the natural proofs notion of Razborov and Rudich for boolean circuit lower bounds, our notion of algebraically natural lower bounds captures nearly all lower bound techniques known. However, unlike the boolean setting, there has been no concrete evidence demonstrating that this is a barrier to obtaining super-polynomial lower bounds for general algebraic circuits, as there is little understanding whether algebraic circuits are expressive enough to support "cryptography" secure against algebraic circuits. Following a similar result of Williams in the boolean setting, we show that the existence of an algebraic natural proofs barrier is equivalent to the existence of succinct derandomization of the polynomial identity testing problem. That is, whether the coefficient vectors of polylog(N)-degree polylog(N)-size circuits is a hitting set for the class of poly(N)-degree poly(N)-size circuits. Further, we give an explicit universal construction showing that if such a succinct hitting set exists, then our universal construction suffices. Further, we assess the existing literature constructing hitting sets for restricted classes of algebraic circuits and observe that none of them are succinct as given. Yet, we show how to modify some of these constructions to obtain succinct hitting sets. This constitutes the first evidence supporting the existence of an algebraic natural proofs barrier. Our framework is similar to the Geometric Complexity Theory (GCT) program of Mulmuley and Sohoni, except that here we emphasize constructiveness of the proofs while the GCT program emphasizes symmetry. Nevertheless, our succinct hitting sets have relevance to the GCT program as they imply lower bounds for the complexity of the defining equations of polynomials computed by small circuits.
{"title":"Succinct hitting sets and barriers to proving algebraic circuits lower bounds","authors":"Michael A. Forbes, Amir Shpilka, Ben lee Volk","doi":"10.1145/3055399.3055496","DOIUrl":"https://doi.org/10.1145/3055399.3055496","url":null,"abstract":"We formalize a framework of algebraically natural lower bounds for algebraic circuits. Just as with the natural proofs notion of Razborov and Rudich for boolean circuit lower bounds, our notion of algebraically natural lower bounds captures nearly all lower bound techniques known. However, unlike the boolean setting, there has been no concrete evidence demonstrating that this is a barrier to obtaining super-polynomial lower bounds for general algebraic circuits, as there is little understanding whether algebraic circuits are expressive enough to support \"cryptography\" secure against algebraic circuits. Following a similar result of Williams in the boolean setting, we show that the existence of an algebraic natural proofs barrier is equivalent to the existence of succinct derandomization of the polynomial identity testing problem. That is, whether the coefficient vectors of polylog(N)-degree polylog(N)-size circuits is a hitting set for the class of poly(N)-degree poly(N)-size circuits. Further, we give an explicit universal construction showing that if such a succinct hitting set exists, then our universal construction suffices. Further, we assess the existing literature constructing hitting sets for restricted classes of algebraic circuits and observe that none of them are succinct as given. Yet, we show how to modify some of these constructions to obtain succinct hitting sets. This constitutes the first evidence supporting the existence of an algebraic natural proofs barrier. Our framework is similar to the Geometric Complexity Theory (GCT) program of Mulmuley and Sohoni, except that here we emphasize constructiveness of the proofs while the GCT program emphasizes symmetry. Nevertheless, our succinct hitting sets have relevance to the GCT program as they imply lower bounds for the complexity of the defining equations of polynomials computed by small circuits.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79161560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pravesh Kothari, R. Mori, R. O'Donnell, David Witmer
Let P:{0,1}k → {0,1} be a nontrivial k-ary predicate. Consider a random instance of the constraint satisfaction problem (P) on n variables with Δ n constraints, each being P applied to k randomly chosen literals. Provided the constraint density satisfies Δ ≫ 1, such an instance is unsatisfiable with high probability. The refutation problem is to efficiently find a proof of unsatisfiability. We show that whenever the predicate P supports a t-wise uniform probability distribution on its satisfying assignments, the sum of squares (SOS) algorithm of degree d = Θ(n/Δ2/(t-1) logΔ) (which runs in time nO(d)) cannot refute a random instance of (P). In particular, the polynomial-time SOS algorithm requires Ω(n(t+1)/2) constraints to refute random instances of CSP(P) when P supports a t-wise uniform distribution on its satisfying assignments. Together with recent work of Lee, Raghavendra, Steurer (2015), our result also implies that any polynomial-size semidefinite programming relaxation for refutation requires at least Ω(n(t+1)/2) constraints. More generally, we consider the δ-refutation problem, in which the goal is to certify that at most a (1 - δ)-fraction of constraints can be simultaneously satisfied. We show that if P is δ-close to supporting a t-wise uniform distribution on satisfying assignments, then the degree-Θ(n/Δ2/(t - 1) logΔ) SOS algorithm cannot (δ+o(1))-refute a random instance of CSP(P). This is the first result to show a distinction between the degree SOS needs to solve the refutation problem and the degree it needs to solve the harder δ-refutation problem. Our results (which also extend with no change to CSPs over larger alphabets) subsume all previously known lower bounds for semialgebraic refutation of random CSPs. For every constraint predicate P, they give a three-way hardness tradeoff between the density of constraints, the SOS degree (hence running time), and the strength of the refutation. By recent algorithmic results of Allen, O'Donnell, Witmer (2015) and Raghavendra, Rao, Schramm (2016), this full three-way tradeoff is tight, up to lower-order factors.
{"title":"Sum of squares lower bounds for refuting any CSP","authors":"Pravesh Kothari, R. Mori, R. O'Donnell, David Witmer","doi":"10.1145/3055399.3055485","DOIUrl":"https://doi.org/10.1145/3055399.3055485","url":null,"abstract":"Let P:{0,1}k → {0,1} be a nontrivial k-ary predicate. Consider a random instance of the constraint satisfaction problem (P) on n variables with Δ n constraints, each being P applied to k randomly chosen literals. Provided the constraint density satisfies Δ ≫ 1, such an instance is unsatisfiable with high probability. The refutation problem is to efficiently find a proof of unsatisfiability. We show that whenever the predicate P supports a t-wise uniform probability distribution on its satisfying assignments, the sum of squares (SOS) algorithm of degree d = Θ(n/Δ2/(t-1) logΔ) (which runs in time nO(d)) cannot refute a random instance of (P). In particular, the polynomial-time SOS algorithm requires Ω(n(t+1)/2) constraints to refute random instances of CSP(P) when P supports a t-wise uniform distribution on its satisfying assignments. Together with recent work of Lee, Raghavendra, Steurer (2015), our result also implies that any polynomial-size semidefinite programming relaxation for refutation requires at least Ω(n(t+1)/2) constraints. More generally, we consider the δ-refutation problem, in which the goal is to certify that at most a (1 - δ)-fraction of constraints can be simultaneously satisfied. We show that if P is δ-close to supporting a t-wise uniform distribution on satisfying assignments, then the degree-Θ(n/Δ2/(t - 1) logΔ) SOS algorithm cannot (δ+o(1))-refute a random instance of CSP(P). This is the first result to show a distinction between the degree SOS needs to solve the refutation problem and the degree it needs to solve the harder δ-refutation problem. Our results (which also extend with no change to CSPs over larger alphabets) subsume all previously known lower bounds for semialgebraic refutation of random CSPs. For every constraint predicate P, they give a three-way hardness tradeoff between the density of constraints, the SOS degree (hence running time), and the strength of the refutation. By recent algorithmic results of Allen, O'Donnell, Witmer (2015) and Raghavendra, Rao, Schramm (2016), this full three-way tradeoff is tight, up to lower-order factors.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88340091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Learning the model ---that is, the mapping from hidden variables to visible ones and vice versa---is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structure: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model. But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the single-layer noisy-OR network, which is a textbook example of a bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up. For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.
{"title":"Provable learning of noisy-OR networks","authors":"Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski","doi":"10.1145/3055399.3055482","DOIUrl":"https://doi.org/10.1145/3055399.3055482","url":null,"abstract":"Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Learning the model ---that is, the mapping from hidden variables to visible ones and vice versa---is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structure: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model. But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the single-layer noisy-OR network, which is a textbook example of a bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up. For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86500393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
{"title":"Set similarity search beyond MinHash","authors":"Tobias Christiani, R. Pagh","doi":"10.1145/3055399.3055443","DOIUrl":"https://doi.org/10.1145/3055399.3055443","url":null,"abstract":"We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90980984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the trace reconstruction problem, an unknown bit string x ∈ {0,1}n is observed through the deletion channel, which deletes each bit of x with some constant probability q, yielding a contracted string x. How many independent copies of x are needed to reconstruct x with high probability? Prior to this work, the best upper bound, due to Holenstein, Mitzenmacher, Panigrahy, and Wieder (2008), was exp(O(n1/2)). We improve this bound to exp(O(n1/3)) using statistics of individual bits in the output and show that this bound is sharp in the restricted model where this is the only information used. Our method, that uses elementary complex analysis, can also handle insertions. Similar results were obtained independently and simultaneously by Anindya De, Ryan O'Donnell and Rocco Servedio.
{"title":"Trace reconstruction with exp(O(n1/3)) samples","authors":"F. Nazarov, Y. Peres","doi":"10.1145/3055399.3055494","DOIUrl":"https://doi.org/10.1145/3055399.3055494","url":null,"abstract":"In the trace reconstruction problem, an unknown bit string x ∈ {0,1}n is observed through the deletion channel, which deletes each bit of x with some constant probability q, yielding a contracted string x. How many independent copies of x are needed to reconstruct x with high probability? Prior to this work, the best upper bound, due to Holenstein, Mitzenmacher, Panigrahy, and Wieder (2008), was exp(O(n1/2)). We improve this bound to exp(O(n1/3)) using statistics of individual bits in the output and show that this bound is sharp in the restricted model where this is the only information used. Our method, that uses elementary complex analysis, can also handle insertions. Similar results were obtained independently and simultaneously by Anindya De, Ryan O'Donnell and Rocco Servedio.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81864643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the (deletion-channel) trace reconstruction problem, there is an unknown n-bit source string x. An algorithm is given access to independent traces of x, where a trace is formed by deleting each bit of x independently with probability δ. The goal of the algorithm is to recover x exactly (with high probability), while minimizing samples (number of traces) and running time. Previously, the best known algorithm for the trace reconstruction problem was due to Holenstein et al. [SODA 2008]; it uses exp(O(n1/2)) samples and running time for any fixed 0 < δ < 1. It is also what we call a "mean-based algorithm", meaning that it only uses the empirical means of the individual bits of the traces. Holenstein et al. also gave a lower bound, showing that any mean-based algorithm must use at least nΩ(logn) samples. In this paper we improve both of these results, obtaining matching upper and lower bounds for mean-based trace reconstruction. For any constant deletion rate 0 < Ω < 1, we give a mean-based algorithm that uses exp(O(n1/3)) time and traces; we also prove that any mean-based algorithm must use at least exp(Ω(n1/3)) traces. In fact, we obtain matching upper and lower bounds even for Ω subconstant and ρ := 1 - Ω subconstant: when (log3 n)/n ≪ Ω ≤ 1/2 the bound is exp(-Θ(δδ n)1/3), and when 1/√n ≪ ρ ≥ 1/2 the bound is exp(-Θ(n/Θ)1/3). Our proofs involve estimates for the maxima of Littlewood polynomials on complex disks. We show that these techniques can also be used to perform trace reconstruction with random insertions and bit-flips in addition to deletions. We also find a surprising result: for deletion probabilities δ > 1/2, the presence of insertions can actually help with trace reconstruction.
{"title":"Optimal mean-based algorithms for trace reconstruction","authors":"Anindya De, R. O'Donnell, R. Servedio","doi":"10.1145/3055399.3055450","DOIUrl":"https://doi.org/10.1145/3055399.3055450","url":null,"abstract":"In the (deletion-channel) trace reconstruction problem, there is an unknown n-bit source string x. An algorithm is given access to independent traces of x, where a trace is formed by deleting each bit of x independently with probability δ. The goal of the algorithm is to recover x exactly (with high probability), while minimizing samples (number of traces) and running time. Previously, the best known algorithm for the trace reconstruction problem was due to Holenstein et al. [SODA 2008]; it uses exp(O(n1/2)) samples and running time for any fixed 0 < δ < 1. It is also what we call a \"mean-based algorithm\", meaning that it only uses the empirical means of the individual bits of the traces. Holenstein et al. also gave a lower bound, showing that any mean-based algorithm must use at least nΩ(logn) samples. In this paper we improve both of these results, obtaining matching upper and lower bounds for mean-based trace reconstruction. For any constant deletion rate 0 < Ω < 1, we give a mean-based algorithm that uses exp(O(n1/3)) time and traces; we also prove that any mean-based algorithm must use at least exp(Ω(n1/3)) traces. In fact, we obtain matching upper and lower bounds even for Ω subconstant and ρ := 1 - Ω subconstant: when (log3 n)/n ≪ Ω ≤ 1/2 the bound is exp(-Θ(δδ n)1/3), and when 1/√n ≪ ρ ≥ 1/2 the bound is exp(-Θ(n/Θ)1/3). Our proofs involve estimates for the maxima of Littlewood polynomials on complex disks. We show that these techniques can also be used to perform trace reconstruction with random insertions and bit-flips in addition to deletions. We also find a surprising result: for deletion probabilities δ > 1/2, the presence of insertions can actually help with trace reconstruction.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74024974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Set functions with convenient properties (such as submodularity) appear in application areas of current interest, such as algorithmic game theory, and allow for improved optimization algorithms. It is natural to ask (e.g., in the context of data driven optimization) how robust such properties are, and whether small deviations from them can be tolerated. We consider two such questions in the important special case of linear set functions. One question that we address is whether any set function that approximately satisfies the modularity equation (linear functions satisfy the modularity equation exactly) is close to a linear function. The answer to this is positive (in a precise formal sense) as shown by Kalton and Roberts [1983] (and further improved by Bondarenko, Prymak, and Radchenko [2013]). We revisit their proof idea that is based on expander graphs, and provide significantly stronger upper bounds by combining it with new techniques. Furthermore, we provide improved lower bounds for this problem. Another question that we address is that of how to learn a linear function h that is close to an approximately linear function f, while querying the value of f on only a small number of sets. We present a deterministic algorithm that makes only linearly many (in the number of items) nonadaptive queries, by this improving over a previous algorithm of Chierichetti, Das, Dasgupta and Kumar [2015] that is randomized and makes more than a quadratic number of queries. Our learning algorithm is based on a Hadamard transform.
{"title":"Approximate modularity revisited","authors":"U. Feige, M. Feldman, Inbal Talgam-Cohen","doi":"10.1145/3055399.3055476","DOIUrl":"https://doi.org/10.1145/3055399.3055476","url":null,"abstract":"Set functions with convenient properties (such as submodularity) appear in application areas of current interest, such as algorithmic game theory, and allow for improved optimization algorithms. It is natural to ask (e.g., in the context of data driven optimization) how robust such properties are, and whether small deviations from them can be tolerated. We consider two such questions in the important special case of linear set functions. One question that we address is whether any set function that approximately satisfies the modularity equation (linear functions satisfy the modularity equation exactly) is close to a linear function. The answer to this is positive (in a precise formal sense) as shown by Kalton and Roberts [1983] (and further improved by Bondarenko, Prymak, and Radchenko [2013]). We revisit their proof idea that is based on expander graphs, and provide significantly stronger upper bounds by combining it with new techniques. Furthermore, we provide improved lower bounds for this problem. Another question that we address is that of how to learn a linear function h that is close to an approximately linear function f, while querying the value of f on only a small number of sets. We present a deterministic algorithm that makes only linearly many (in the number of items) nonadaptive queries, by this improving over a previous algorithm of Chierichetti, Das, Dasgupta and Kumar [2015] that is randomized and makes more than a quadratic number of queries. Our learning algorithm is based on a Hadamard transform.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84147076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}