Abstract Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the flayers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization, we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.
{"title":"Generalization error bounds for iterative recovery algorithms unfolded as neural networks","authors":"Ekkehard Schnoor, Arash Behboodi, Holger Rauhut","doi":"10.1093/imaiai/iaad023","DOIUrl":"https://doi.org/10.1093/imaiai/iaad023","url":null,"abstract":"Abstract Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the flayers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization, we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136266735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract We consider the problem of recovering the superposition of $R$ distinct complex exponential functions from compressed non-uniform time-domain samples. Total variation (TV) minimization or atomic norm minimization was proposed in the literature to recover the $R$ frequencies or the missing data. However, it is known that in order for TV minimization and atomic norm minimization to recover the missing data or the frequencies, the underlying $R$ frequencies are required to be well separated, even when the measurements are noiseless. This paper shows that the Hankel matrix recovery approach can super-resolve the $R$ complex exponentials and their frequencies from compressed non-uniform measurements, regardless of how close their frequencies are to each other. We propose a new concept of orthonormal atomic norm minimization (OANM), and demonstrate that the success of Hankel matrix recovery in separation-free super-resolution comes from the fact that the nuclear norm of a Hankel matrix is an orthonormal atomic norm. More specifically, we show that, in traditional atomic norm minimization, the underlying parameter values must be well separated to achieve successful signal recovery, if the atoms are changing continuously with respect to the continuously valued parameter. In contrast, for the OANM, it is possible the OANM is successful even though the original atoms can be arbitrarily close. As a byproduct of this research, we provide one matrix-theoretic inequality of nuclear norm, and give its proof using the theory of compressed sensing.
{"title":"Separation-free super-resolution from compressed measurements is possible: an orthonormal atomic norm minimization approach","authors":"Jirong Yi, Soura Dasgupta, Jian-Feng Cai, Mathews Jacob, Jingchao Gao, Myung Cho, Weiyu Xu","doi":"10.1093/imaiai/iaad033","DOIUrl":"https://doi.org/10.1093/imaiai/iaad033","url":null,"abstract":"Abstract We consider the problem of recovering the superposition of $R$ distinct complex exponential functions from compressed non-uniform time-domain samples. Total variation (TV) minimization or atomic norm minimization was proposed in the literature to recover the $R$ frequencies or the missing data. However, it is known that in order for TV minimization and atomic norm minimization to recover the missing data or the frequencies, the underlying $R$ frequencies are required to be well separated, even when the measurements are noiseless. This paper shows that the Hankel matrix recovery approach can super-resolve the $R$ complex exponentials and their frequencies from compressed non-uniform measurements, regardless of how close their frequencies are to each other. We propose a new concept of orthonormal atomic norm minimization (OANM), and demonstrate that the success of Hankel matrix recovery in separation-free super-resolution comes from the fact that the nuclear norm of a Hankel matrix is an orthonormal atomic norm. More specifically, we show that, in traditional atomic norm minimization, the underlying parameter values must be well separated to achieve successful signal recovery, if the atoms are changing continuously with respect to the continuously valued parameter. In contrast, for the OANM, it is possible the OANM is successful even though the original atoms can be arbitrarily close. As a byproduct of this research, we provide one matrix-theoretic inequality of nuclear norm, and give its proof using the theory of compressed sensing.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136266739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract We investigate minimax testing for detecting local signals or linear combinations of such signals when only indirect data are available. Naturally, in the presence of noise, signals that are too small cannot be reliably detected. In a Gaussian white noise model, we discuss upper and lower bounds for the minimal size of the signal such that testing with small error probabilities is possible. In certain situations we are able to characterize the asymptotic minimax detection boundary. Our results are applied to inverse problems such as numerical differentiation, deconvolution and the inversion of the Radon transform.
{"title":"Minimax detection of localized signals in statistical inverse problems","authors":"Markus Pohlmann, Frank Werner, Axel Munk","doi":"10.1093/imaiai/iaad026","DOIUrl":"https://doi.org/10.1093/imaiai/iaad026","url":null,"abstract":"Abstract We investigate minimax testing for detecting local signals or linear combinations of such signals when only indirect data are available. Naturally, in the presence of noise, signals that are too small cannot be reliably detected. In a Gaussian white noise model, we discuss upper and lower bounds for the minimal size of the signal such that testing with small error probabilities is possible. In certain situations we are able to characterize the asymptotic minimax detection boundary. Our results are applied to inverse problems such as numerical differentiation, deconvolution and the inversion of the Radon transform.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136266731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract In this paper, we show the important roles of sharp minima and strong minima for robust recovery. We also obtain several characterizations of sharp minima for convex regularized optimization problems. Our characterizations are quantitative and verifiable especially for the case of decomposable norm regularized problems including sparsity, group-sparsity and low-rank convex problems. For group-sparsity optimization problems, we show that a unique solution is a strong solution and obtains quantitative characterizations for solution uniqueness.
{"title":"Sharp, strong and unique minimizers for low complexity robust recovery","authors":"Jalal Fadili, Tran T. A. Nghia, Trinh T. T. Tran","doi":"10.1093/imaiai/iaad005","DOIUrl":"https://doi.org/10.1093/imaiai/iaad005","url":null,"abstract":"Abstract In this paper, we show the important roles of sharp minima and strong minima for robust recovery. We also obtain several characterizations of sharp minima for convex regularized optimization problems. Our characterizations are quantitative and verifiable especially for the case of decomposable norm regularized problems including sparsity, group-sparsity and low-rank convex problems. For group-sparsity optimization problems, we show that a unique solution is a strong solution and obtains quantitative characterizations for solution uniqueness.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134956496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given up to $d$ positive items in a large population of $n$ items ($d ll n$), the goal of threshold group testing is to efficiently identify the positives via tests, where a test on a subset of items is positive if the subset contains at least $u$ positive items, negative if it contains up to $ell $ positive items and arbitrary (either positive or negative) otherwise. The parameter $g = u - ell - 1$ is called the gap. In non-adaptive strategies, all tests are fixed in advance and can be represented as a measurement matrix, in which each row and column represent a test and an item, respectively. In this paper, we consider non-adaptive threshold group testing with consecutive positives in which the items are linearly ordered and the positives are consecutive in that order. We show that by designing deterministic and strongly explicit measurement matrices, $lceil log _{2}{lceil frac {n}{d} rceil } rceil + 2d + 3$ (respectively, $lceil log _{2}{lceil frac {n}{d} rceil } rceil + 3d$) tests suffice to identify the positives in $O left ( log _{2}{frac {n}{d}} + d right )$ time when $g = 0$ (respectively, $g> 0$). The results significantly improve the state-of-the-art scheme that needs $15 lceil log _{2}{lceil frac {n}{d} rceil } rceil + 4d + 71$ tests to identify the positives in $O left ( frac {n}{d} log _{2}{frac {n}{d}} + ud^{2} right )$ time, and whose associated measurement matrices are random and (non-strongly) explicit.
{"title":"Non-adaptive algorithms for threshold group testing with consecutive positives","authors":"","doi":"10.1093/imaiai/iaad009","DOIUrl":"https://doi.org/10.1093/imaiai/iaad009","url":null,"abstract":"\u0000 Given up to $d$ positive items in a large population of $n$ items ($d ll n$), the goal of threshold group testing is to efficiently identify the positives via tests, where a test on a subset of items is positive if the subset contains at least $u$ positive items, negative if it contains up to $ell $ positive items and arbitrary (either positive or negative) otherwise. The parameter $g = u - ell - 1$ is called the gap. In non-adaptive strategies, all tests are fixed in advance and can be represented as a measurement matrix, in which each row and column represent a test and an item, respectively. In this paper, we consider non-adaptive threshold group testing with consecutive positives in which the items are linearly ordered and the positives are consecutive in that order. We show that by designing deterministic and strongly explicit measurement matrices, $lceil log _{2}{lceil frac {n}{d} rceil } rceil + 2d + 3$ (respectively, $lceil log _{2}{lceil frac {n}{d} rceil } rceil + 3d$) tests suffice to identify the positives in $O left ( log _{2}{frac {n}{d}} + d right )$ time when $g = 0$ (respectively, $g> 0$). The results significantly improve the state-of-the-art scheme that needs $15 lceil log _{2}{lceil frac {n}{d} rceil } rceil + 4d + 71$ tests to identify the positives in $O left ( frac {n}{d} log _{2}{frac {n}{d}} + ud^{2} right )$ time, and whose associated measurement matrices are random and (non-strongly) explicit.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74278014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that has been adapted to metric spaces. A standard approach is to consider the Fréchet mean. In practice, computing the Fréchet mean for sets of large graphs presents many computational issues. In this work, we suggest a method that may be used to compute the Fréchet mean for sets of graphs which is metric independent. We show that the technique proposed can be used to determine the Fréchet mean when considering the Hamming distance or a distance defined by the difference between the spectra of the adjacency matrices of the graphs.
{"title":"Theoretical analysis and computation of the sample Fréchet mean of sets of large graphs for various metrics","authors":"Daniel Ferguson, F. G. Meyer","doi":"10.1093/imaiai/iaad002","DOIUrl":"https://doi.org/10.1093/imaiai/iaad002","url":null,"abstract":"\u0000 To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that has been adapted to metric spaces. A standard approach is to consider the Fréchet mean. In practice, computing the Fréchet mean for sets of large graphs presents many computational issues. In this work, we suggest a method that may be used to compute the Fréchet mean for sets of graphs which is metric independent. We show that the technique proposed can be used to determine the Fréchet mean when considering the Hamming distance or a distance defined by the difference between the spectra of the adjacency matrices of the graphs.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89007827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article studies the decoding problem (also known as the classification or the segmentation problem) with pairwise Markov models (PMMs). A PMM is a process where the observation process and the underlying state sequence form a two-dimensional Markov chain, a natural generalization of hidden Markov model. The standard solutions to the decoding problem are the so-called Viterbi path—a sequence with maximum state path probability given the observations—or the pointwise maximum a posteriori (PMAP) path that maximizes the expected number of correctly classified entries. When the goal is to simultaneously maximize both criterions—conditional probability (corresponding to Viterbi path) and pointwise conditional probability (corresponding to PMAP path)—then they are combined into one single criterion via the regularization parameter $C$. The main objective of the article is to study the behaviour of the solution—called the hybrid path—as $C$ grows. Increasing $C$ increases the conditional probability of the hybrid path and when $C$ is big enough then every hybrid path is a Viterbi path. We show that hybrid paths also approach the Viterbi path locally: we define $m$-locally Viterbi paths and show that the hybrid path is $m$-locally Viterbi whenever $C$ is big enough. This all might lead to an impression that when $C$ is relatively big then any hybrid path that is not yet Viterbi differs from the Viterbi path by a few single entries only. We argue that this intuition is wrong, because when unique and $m$-locally Viterbi, then different hybrid paths differ by at least $m$ entries. Thus, when $C$ increases then the different hybrid paths tend to differ from each other by larger and larger intervals. Hence the hybrid paths might offer a variety of rather different solutions to the decoding problem.
{"title":"Local Viterbi property in decoding","authors":"J. Lember","doi":"10.1093/imaiai/iaad004","DOIUrl":"https://doi.org/10.1093/imaiai/iaad004","url":null,"abstract":"\u0000 The article studies the decoding problem (also known as the classification or the segmentation problem) with pairwise Markov models (PMMs). A PMM is a process where the observation process and the underlying state sequence form a two-dimensional Markov chain, a natural generalization of hidden Markov model. The standard solutions to the decoding problem are the so-called Viterbi path—a sequence with maximum state path probability given the observations—or the pointwise maximum a posteriori (PMAP) path that maximizes the expected number of correctly classified entries. When the goal is to simultaneously maximize both criterions—conditional probability (corresponding to Viterbi path) and pointwise conditional probability (corresponding to PMAP path)—then they are combined into one single criterion via the regularization parameter $C$. The main objective of the article is to study the behaviour of the solution—called the hybrid path—as $C$ grows. Increasing $C$ increases the conditional probability of the hybrid path and when $C$ is big enough then every hybrid path is a Viterbi path. We show that hybrid paths also approach the Viterbi path locally: we define $m$-locally Viterbi paths and show that the hybrid path is $m$-locally Viterbi whenever $C$ is big enough. This all might lead to an impression that when $C$ is relatively big then any hybrid path that is not yet Viterbi differs from the Viterbi path by a few single entries only. We argue that this intuition is wrong, because when unique and $m$-locally Viterbi, then different hybrid paths differ by at least $m$ entries. Thus, when $C$ increases then the different hybrid paths tend to differ from each other by larger and larger intervals. Hence the hybrid paths might offer a variety of rather different solutions to the decoding problem.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82746142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study a variation of Bayesian $M$-ary hypothesis testing in which the test outputs a list of $L$ candidates out of the $M$ possible upon processing the observation. We study the minimum error probability of list hypothesis testing, where an error is defined as the event where the true hypothesis is not in the list output by the test. We derive two exact expressions of the minimum probability or error. The first is expressed as the error probability of a certain non-Bayesian binary hypothesis test and is reminiscent of the meta-converse bound by Polyanskiy, Poor and Verdú (2010). The second, is expressed as the tail probability of the likelihood ratio between the two distributions involved in the aforementioned non-Bayesian binary hypothesis test. Hypothesis testing, error probability, information theory.
{"title":"Minimum probability of error of list M-ary hypothesis testing","authors":"Ehsan Asadi Kangarshahi, A. Guillén i Fàbregas","doi":"10.1093/imaiai/iaad001","DOIUrl":"https://doi.org/10.1093/imaiai/iaad001","url":null,"abstract":"\u0000 We study a variation of Bayesian $M$-ary hypothesis testing in which the test outputs a list of $L$ candidates out of the $M$ possible upon processing the observation. We study the minimum error probability of list hypothesis testing, where an error is defined as the event where the true hypothesis is not in the list output by the test. We derive two exact expressions of the minimum probability or error. The first is expressed as the error probability of a certain non-Bayesian binary hypothesis test and is reminiscent of the meta-converse bound by Polyanskiy, Poor and Verdú (2010). The second, is expressed as the tail probability of the likelihood ratio between the two distributions involved in the aforementioned non-Bayesian binary hypothesis test. Hypothesis testing, error probability, information theory.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90910163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two important non-parametric approaches to clustering emerged in the 1970s: clustering by level sets or cluster tree as proposed by Hartigan, and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hostetler. In a recent paper, we draw a connection between these two approaches, in particular, by showing that the gradient flow provides a way to move along the cluster tree. Here, we argue the case that these two approaches are fundamentally the same. We do so by proposing two ways of obtaining a partition from the cluster tree—each one of them very natural in its own right—and showing that both of them reduce to the partition given by the gradient flow under standard assumptions on the sampling density.
{"title":"A unifying view of modal clustering","authors":"Ery Arias-Castro;Wanli Qiao","doi":"10.1093/imaiai/iaac030","DOIUrl":"https://doi.org/10.1093/imaiai/iaac030","url":null,"abstract":"Two important non-parametric approaches to clustering emerged in the 1970s: clustering by level sets or cluster tree as proposed by Hartigan, and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hostetler. In a recent paper, we draw a connection between these two approaches, in particular, by showing that the gradient flow provides a way to move along the cluster tree. Here, we argue the case that these two approaches are fundamentally the same. We do so by proposing two ways of obtaining a partition from the cluster tree—each one of them very natural in its own right—and showing that both of them reduce to the partition given by the gradient flow under standard assumptions on the sampling density.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50298052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}