首页 > 最新文献

Information and Inference-A Journal of the Ima最新文献

英文 中文
Optimal variable clustering for high-dimensional matrix valued data. 高维矩阵值数据的最优变量聚类。
IF 1.4 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2025-03-12 eCollection Date: 2025-03-01 DOI: 10.1093/imaiai/iaaf001
Inbeom Lee, Siyi Deng, Yang Ning

Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.

矩阵值数据在许多应用中变得越来越普遍。大多数针对这类数据的现有聚类方法都是针对均值模型定制的,并且没有考虑特征的依赖结构,这可能是非常有用的,特别是在高维设置或平均值信息不可用的情况下。为了从依赖结构中提取信息用于聚类,我们提出了一种新的潜在变量模型,用于以矩阵形式排列的特征,用一些未知隶属矩阵表示行和列的聚类。在此模型下,我们进一步提出了一类以加权协方差矩阵的差值作为不相似度度量的分层聚类算法。理论上,我们证明了在温和的条件下,我们的算法在高维设置下达到了聚类一致性。虽然这种一致性结果适用于我们的算法,并具有广泛的加权协方差矩阵,但该结果的条件取决于权重的选择。为了研究权重如何影响我们算法的理论性能,我们根据一些聚类分离度量,在我们的潜变量模型下建立了聚类的最小最大下界。给定这些结果,我们在某种意义上确定最优权重,使用该权重保证我们的算法是最小最大速率最优的。最后讨论了该算法在最优权值下的实际实现。仿真研究表明,该算法在调整后的Rand指数(ARI)方面优于现有方法。该方法应用于基因组数据集并产生有意义的解释。
{"title":"Optimal variable clustering for high-dimensional matrix valued data.","authors":"Inbeom Lee, Siyi Deng, Yang Ning","doi":"10.1093/imaiai/iaaf001","DOIUrl":"10.1093/imaiai/iaaf001","url":null,"abstract":"<p><p>Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"14 1","pages":"iaaf001"},"PeriodicalIF":1.4,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Dyson equalizer: adaptive noise stabilization for low-rank signal detection and recovery. 戴森均衡器:用于低秩信号检测和恢复的自适应噪声稳定。
IF 1.6 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2025-01-16 eCollection Date: 2025-03-01 DOI: 10.1093/imaiai/iaae036
Boris Landa, Yuval Kluger

Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, e.g. thresholding the singular values of the data matrix at a certain critical level. This approach is well established in the case of homoskedastic noise, where the noise variance is identical across the entries. However, in numerous applications, the noise can be heteroskedastic, where the noise characteristics may vary considerably across the rows and columns of the data. In this scenario, the spectral behaviour of the noise can differ significantly from the homoskedastic case, posing various challenges for signal detection and recovery. To address these challenges, we develop an adaptive normalization procedure that equalizes the average noise variance across the rows and columns of a given data matrix. Our proposed procedure is data-driven and fully automatic, supporting a broad range of noise distributions, variance patterns and signal structures. Our approach relies on random matrix theory results that describe the resolvent of the noise via the so-called Dyson equation. By leveraging this relation, we can accurately infer the noise level in each row and each column directly from the resolvent of the data. We establish that in many cases, our normalization enforces the standard spectral behaviour of homoskedastic noise-the Marchenko-Pastur (MP) law, allowing for simple and reliable detection of signal components. Furthermore, we demonstrate that our approach can substantially improve signal recovery in heteroskedastic settings by manipulating the spectrum after normalization. Lastly, we apply our method to single-cell RNA sequencing and spatial transcriptomics data, showcasing accurate fits to the MP law after normalization.

在噪声数据矩阵中检测和恢复低秩信号是数据分析中的一项基本任务。通常,这项任务是通过检查和操纵观测数据的频谱来解决的,例如,在某个临界水平上对数据矩阵的奇异值进行阈值设定。这种方法在均匀噪声的情况下很好地建立起来,其中噪声方差在各个条目之间是相同的。然而,在许多应用中,噪声可能是异方差的,其中噪声特性可能在数据的行和列之间变化很大。在这种情况下,噪声的频谱行为可能与均方差情况有很大不同,这给信号检测和恢复带来了各种挑战。为了解决这些挑战,我们开发了一种自适应归一化过程,该过程可以均衡给定数据矩阵的行和列之间的平均噪声方差。我们提出的程序是数据驱动和全自动的,支持广泛的噪声分布,方差模式和信号结构。我们的方法依赖于随机矩阵理论的结果,该结果通过所谓的戴森方程描述了噪声的解决方案。通过利用这种关系,我们可以直接从数据的解析中准确地推断出每行和每列中的噪声水平。我们确定,在许多情况下,我们的归一化强制均方差噪声的标准频谱行为-马尔琴科-巴斯德(MP)定律,允许简单可靠地检测信号成分。此外,我们证明了我们的方法可以通过操作归一化后的频谱大大提高异方差设置中的信号恢复。最后,我们将我们的方法应用于单细胞RNA测序和空间转录组学数据,显示了归一化后MP定律的准确拟合。
{"title":"The Dyson equalizer: adaptive noise stabilization for low-rank signal detection and recovery.","authors":"Boris Landa, Yuval Kluger","doi":"10.1093/imaiai/iaae036","DOIUrl":"10.1093/imaiai/iaae036","url":null,"abstract":"<p><p>Detecting and recovering a low-rank signal in a noisy data matrix is a fundamental task in data analysis. Typically, this task is addressed by inspecting and manipulating the spectrum of the observed data, e.g. thresholding the singular values of the data matrix at a certain critical level. This approach is well established in the case of homoskedastic noise, where the noise variance is identical across the entries. However, in numerous applications, the noise can be heteroskedastic, where the noise characteristics may vary considerably across the rows and columns of the data. In this scenario, the spectral behaviour of the noise can differ significantly from the homoskedastic case, posing various challenges for signal detection and recovery. To address these challenges, we develop an adaptive normalization procedure that equalizes the average noise variance across the rows and columns of a given data matrix. Our proposed procedure is data-driven and fully automatic, supporting a broad range of noise distributions, variance patterns and signal structures. Our approach relies on random matrix theory results that describe the resolvent of the noise via the so-called Dyson equation. By leveraging this relation, we can accurately infer the noise level in each row and each column directly from the resolvent of the data. We establish that in many cases, our normalization enforces the standard spectral behaviour of homoskedastic noise-the Marchenko-Pastur (MP) law, allowing for simple and reliable detection of signal components. Furthermore, we demonstrate that our approach can substantially improve signal recovery in heteroskedastic settings by manipulating the spectrum after normalization. Lastly, we apply our method to single-cell RNA sequencing and spatial transcriptomics data, showcasing accurate fits to the MP law after normalization.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"14 1","pages":"iaae036"},"PeriodicalIF":1.6,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735832/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143013811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. 双随机归一化图拉普拉奇:向流形拉普拉奇的收敛性和对离群噪声的鲁棒性。
IF 1.6 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2024-09-20 eCollection Date: 2024-12-01 DOI: 10.1093/imaiai/iaae026
Xiuyuan Cheng, Boris Landa

Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when [Formula: see text] data points are i.i.d. sampled from a general [Formula: see text]-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of [Formula: see text] and kernel bandwidth [Formula: see text], the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be [Formula: see text] at finite large [Formula: see text] up to log factors, achieved at the scaling of [Formula: see text]. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.

在基于图的数据分析中,双随机归一化为图拉普拉卡提供了另一种归一化方法,并且可以通过 Sinkhorn-Knopp (SK) 迭代高效计算。本文证明了当[公式:见正文]数据点是从一个嵌入到可能的高维空间中的一般[公式:见正文]维流形中进行 i.i.d. 采样时,双随机归一化图拉普拉奇与流形(加权)拉普拉奇的收敛率。在[公式:见正文]和核带宽[公式:见正文]的某些联合限制下,图拉普拉斯算子(2 正态下)的点向收敛速率被证明是[公式:见正文]在有限大[公式:见正文]对数因子以下,在[公式:见正文]的缩放比例下实现的。当流形数据被离群噪声干扰时,我们从理论上证明了图拉普拉斯的点向一致性,它与干净流形数据的速率相匹配,而且还有一个与噪声矢量之间以及与数据矢量之间的内积的有界性成正比的附加项。我们的分析表明,不是精确的双随机归一化,而是近似的归一化也能达到相同的一致性率,受此启发,我们提出了一个近似和受约束的矩阵缩放问题,该问题可以通过提前终止的 SK 迭代来解决。数值实验支持了我们的理论结果,并显示了双随机归一化图拉普拉卡对高维离群噪声的鲁棒性。
{"title":"Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise.","authors":"Xiuyuan Cheng, Boris Landa","doi":"10.1093/imaiai/iaae026","DOIUrl":"10.1093/imaiai/iaae026","url":null,"abstract":"<p><p>Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when [Formula: see text] data points are i.i.d. sampled from a general [Formula: see text]-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of [Formula: see text] and kernel bandwidth [Formula: see text], the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be [Formula: see text] at finite large [Formula: see text] up to log factors, achieved at the scaling of [Formula: see text]. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"13 4","pages":"iaae026"},"PeriodicalIF":1.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11415053/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142298149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phase transition and higher order analysis of Lq regularization under dependence. 依赖性下 Lq 正则化的相变和高阶分析。
IF 1.4 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2024-02-20 eCollection Date: 2024-03-01 DOI: 10.1093/imaiai/iaae005
Hanwen Huang, Peng Zeng, Qinglong Yang

We study the problem of estimating a [Formula: see text]-sparse signal [Formula: see text] from a set of noisy observations [Formula: see text] under the model [Formula: see text], where [Formula: see text] is the measurement matrix the row of which is drawn from distribution [Formula: see text]. We consider the class of [Formula: see text]-regularized least squares (LQLS) given by the formulation [Formula: see text], where [Formula: see text]  [Formula: see text] denotes the [Formula: see text]-norm. In the setting [Formula: see text] with fixed [Formula: see text] and [Formula: see text], we derive the asymptotic risk of [Formula: see text] for arbitrary covariance matrix [Formula: see text] that generalizes the existing results for standard Gaussian design, i.e. [Formula: see text]. The results were derived from the non-rigorous replica method. We perform a higher-order analysis for LQLS in the small-error regime in which the first dominant term can be used to determine the phase transition behavior of LQLS. Our results show that the first dominant term does not depend on the covariance structure of [Formula: see text] in the cases [Formula: see text] and [Formula: see text] which indicates that the correlations among predictors only affect the phase transition curve in the case [Formula: see text] a.k.a. LASSO. To study the influence of the covariance structure of [Formula: see text] on the performance of LQLS in the cases [Formula: see text] and [Formula: see text], we derive the explicit formulas for the second dominant term in the expansion of the asymptotic risk in terms of small error. Extensive computational experiments confirm that our analytical predictions are consistent with numerical results.

我们研究在[公式:见正文]模型下,从一组噪声观测值[公式:见正文]中估计[公式:见正文]稀疏信号[公式:见正文]的问题,其中[公式:见正文]是测量矩阵,其行从分布[公式:见正文]中抽取。我们考虑[公式:见正文]公式[公式:见正文]给出的[公式:见正文]正则化最小二乘法(LQLS),其中[公式:见正文][公式:见正文]表示[公式:见正文]正则。在固定[式:见正文]和[式:见正文]的[式:见正文]设置中,我们推导出了任意协方差矩阵[式:见正文]的[式:见正文]的渐近风险,概括了标准高斯设计的现有结果,即[式:见正文]。这些结果来自非严格复制法。我们对小误差机制下的 LQLS 进行了高阶分析,其中第一主项可用于确定 LQLS 的相变行为。我们的结果表明,在[公式:见正文]和[公式:见正文]两种情况下,第一支配项并不依赖于[公式:见正文]的协方差结构,这表明预测因子之间的相关性只影响[公式:见正文](又称 LASSO)情况下的相变曲线。为了研究[公式:见正文]的协方差结构对[公式:见正文]和[公式:见正文]情况下 LQLS 性能的影响,我们推导出了以小误差为单位的渐近风险扩展中第二个主导项的明确公式。广泛的计算实验证实,我们的分析预测与数值结果是一致的。
{"title":"Phase transition and higher order analysis of <i>L<sub>q</sub></i> regularization under dependence.","authors":"Hanwen Huang, Peng Zeng, Qinglong Yang","doi":"10.1093/imaiai/iaae005","DOIUrl":"10.1093/imaiai/iaae005","url":null,"abstract":"<p><p>We study the problem of estimating a [Formula: see text]-sparse signal [Formula: see text] from a set of noisy observations [Formula: see text] under the model [Formula: see text], where [Formula: see text] is the measurement matrix the row of which is drawn from distribution [Formula: see text]. We consider the class of [Formula: see text]-regularized least squares (LQLS) given by the formulation [Formula: see text], where [Formula: see text]  [Formula: see text] denotes the [Formula: see text]-norm. In the setting [Formula: see text] with fixed [Formula: see text] and [Formula: see text], we derive the asymptotic risk of [Formula: see text] for arbitrary covariance matrix [Formula: see text] that generalizes the existing results for standard Gaussian design, i.e. [Formula: see text]. The results were derived from the non-rigorous replica method. We perform a higher-order analysis for LQLS in the small-error regime in which the first dominant term can be used to determine the phase transition behavior of LQLS. Our results show that the first dominant term does not depend on the covariance structure of [Formula: see text] in the cases [Formula: see text] and [Formula: see text] which indicates that the correlations among predictors only affect the phase transition curve in the case [Formula: see text] a.k.a. LASSO. To study the influence of the covariance structure of [Formula: see text] on the performance of LQLS in the cases [Formula: see text] and [Formula: see text], we derive the explicit formulas for the second dominant term in the expansion of the asymptotic risk in terms of small error. Extensive computational experiments confirm that our analytical predictions are consistent with numerical results.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"13 1","pages":"iaae005"},"PeriodicalIF":1.4,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10878746/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139933465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On statistical inference with high-dimensional sparse CCA. 高维稀疏CCA的统计推断。
IF 16.4 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-11-17 eCollection Date: 2023-12-01 DOI: 10.1093/imaiai/iaad040
Nilanjana Laha, Nathan Huey, Brent Coull, Rajarshi Mukherjee

We consider asymptotically exact inference on the leading canonical correlation directions and strengths between two high-dimensional vectors under sparsity restrictions. In this regard, our main contribution is developing a novel representation of the Canonical Correlation Analysis problem, based on which one can operationalize a one-step bias correction on reasonable initial estimators. Our analytic results in this regard are adaptive over suitable structural restrictions of the high-dimensional nuisance parameters, which, in this set-up, correspond to the covariance matrices of the variables of interest. We further supplement the theoretical guarantees behind our procedures with extensive numerical studies.

在稀疏性条件下,研究了两个高维向量间的典型相关方向和强度的渐近精确推断。在这方面,我们的主要贡献是开发了典型相关分析问题的新表示,在此基础上,可以对合理的初始估计量进行一步偏差校正。在这方面,我们的分析结果在高维干扰参数的适当结构限制下是自适应的,在这种设置中,这些参数对应于感兴趣变量的协方差矩阵。我们进一步补充理论保证背后的程序与广泛的数值研究。
{"title":"On statistical inference with high-dimensional sparse CCA.","authors":"Nilanjana Laha, Nathan Huey, Brent Coull, Rajarshi Mukherjee","doi":"10.1093/imaiai/iaad040","DOIUrl":"10.1093/imaiai/iaad040","url":null,"abstract":"<p><p>We consider asymptotically exact inference on the leading canonical correlation directions and strengths between two high-dimensional vectors under sparsity restrictions. In this regard, our main contribution is developing a novel representation of the Canonical Correlation Analysis problem, based on which one can operationalize a one-step bias correction on reasonable initial estimators. Our analytic results in this regard are adaptive over suitable structural restrictions of the high-dimensional nuisance parameters, which, in this set-up, correspond to the covariance matrices of the variables of interest. We further supplement the theoretical guarantees behind our procedures with extensive numerical studies.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"12 4","pages":"iaad040"},"PeriodicalIF":16.4,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10656287/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138048165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Black-box tests for algorithmic stability. 算法稳定性的黑盒测试
IF 1.4 4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-10-14 eCollection Date: 2023-12-01 DOI: 10.1093/imaiai/iaad039
Byol Kim, Rina Foygel Barber

Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g. removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications-for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behaviour on various datasets. In this work, we lay out a formal statistical framework for this kind of black-box testing without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black-box test to identify algorithmic stability.

算法稳定性是学习理论中的一个概念,它表达了输入数据的变化(例如去除单个数据点)可能影响回归算法输出的程度。知道算法的稳定性特性通常对许多下游应用有用,例如,已知稳定性会导致期望的泛化特性和预测推理保证。然而,目前在实践中使用的许多现代算法过于复杂,无法对其稳定性特性进行理论分析,因此我们只能尝试通过对算法在各种数据集上的行为进行经验探索来建立这些特性。在这项工作中,我们为这种黑箱测试制定了一个正式的统计框架,而不对算法或数据分布进行任何假设,并对任何黑箱测试识别算法稳定性的能力建立了基本界限。
{"title":"Black-box tests for algorithmic stability.","authors":"Byol Kim, Rina Foygel Barber","doi":"10.1093/imaiai/iaad039","DOIUrl":"10.1093/imaiai/iaad039","url":null,"abstract":"<p><p>Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g. removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications-for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behaviour on various datasets. In this work, we lay out a formal statistical framework for this kind of <i>black-box testing</i> without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black-box test to identify algorithmic stability.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"12 4","pages":"2690-2719"},"PeriodicalIF":1.4,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10576650/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41239705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian denoising of structured sources and its implications on learning-based denoising 结构化数据源的贝叶斯去噪及其在基于学习的去噪中的意义
4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-09-19 DOI: 10.1093/imaiai/iaad036
Wenda Zhou, Joachim Wabnig, Shirin Jalali
Abstract Denoising a stationary process $(X_{i})_{i in mathbb{Z}}$ corrupted by additive white Gaussian noise $(Z_{i})_{i in mathbb{Z}}$ is a classic, well-studied and fundamental problem in information theory and statistical signal processing. However, finding theoretically founded computationally efficient denoising methods applicable to general sources is still an open problem. In the Bayesian set-up where the source distribution is known, a minimum mean square error (MMSE) denoiser estimates $X^{n}$ from noisy measurements $Y^{n}$ as $hat{X}^{n}=mathrm{E}[X^{n}|Y^{n}]$. However, for general sources, computing $mathrm{E}[X^{n}|Y^{n}]$ is computationally very challenging, if not infeasible. In this paper, starting from a Bayesian set-up, a novel denoising method, namely, quantized maximum a posteriori (Q-MAP) denoiser is proposed and its asymptotic performance is analysed. Both for memoryless sources, and for structured first-order Markov sources, it is shown that, asymptotically, as $sigma _{z}^{2} $ (noise variance) converges to zero, ${1over sigma _{z}^{2}} mathrm{E}[(X_{i}-hat{X}^{mathrm{QMAP}}_{i})^{2}]$ converges to the information dimension of the source. For the studied memoryless sources, this limit is known to be optimal. A key advantage of the Q-MAP denoiser, unlike an MMSE denoiser, is that it highlights the key properties of the source distribution that are to be used in its denoising. This key property leads to a new learning-based denoising approach that is applicable to generic structured sources. Using ImageNet database for training, initial simulation results exploring the performance of such a learning-based denoiser in image denoising are presented.
对受加性高斯白噪声干扰的平稳过程$(X_{i})_{i in mathbb{Z}}$去噪$(Z_{i})_{i in mathbb{Z}}$是信息论和统计信号处理中一个经典的、研究得很充分的基础问题。然而,寻找适用于一般信号源的理论基础计算高效的去噪方法仍然是一个悬而未决的问题。在已知源分布的贝叶斯设置中,最小均方误差(MMSE)去噪器从噪声测量$Y^{n}$估计$X^{n}$为$hat{X}^{n}=mathrm{E}[X^{n}|Y^{n}]$。然而,对于一般资源,计算$mathrm{E}[X^{n}|Y^{n}]$在计算上是非常具有挑战性的,如果不是不可行的。本文从贝叶斯模型出发,提出了一种新的去噪方法——量化最大后验去噪(Q-MAP),并对其渐近性能进行了分析。对于无记忆源和结构化一阶马尔可夫源,结果表明,随着$sigma _{z}^{2} $(噪声方差)收敛于零,${1over sigma _{z}^{2}} mathrm{E}[(X_{i}-hat{X}^{mathrm{QMAP}}_{i})^{2}]$收敛于源的信息维。对于所研究的无记忆源,已知这个限制是最优的。与MMSE去噪器不同,Q-MAP去噪器的一个关键优点是,它突出了用于去噪的源分布的关键属性。这一关键特性导致了一种新的基于学习的去噪方法,适用于一般结构化源。利用ImageNet数据库进行训练,给出了初步的仿真结果,探索了这种基于学习的去噪器在图像去噪中的性能。
{"title":"Bayesian denoising of structured sources and its implications on learning-based denoising","authors":"Wenda Zhou, Joachim Wabnig, Shirin Jalali","doi":"10.1093/imaiai/iaad036","DOIUrl":"https://doi.org/10.1093/imaiai/iaad036","url":null,"abstract":"Abstract Denoising a stationary process $(X_{i})_{i in mathbb{Z}}$ corrupted by additive white Gaussian noise $(Z_{i})_{i in mathbb{Z}}$ is a classic, well-studied and fundamental problem in information theory and statistical signal processing. However, finding theoretically founded computationally efficient denoising methods applicable to general sources is still an open problem. In the Bayesian set-up where the source distribution is known, a minimum mean square error (MMSE) denoiser estimates $X^{n}$ from noisy measurements $Y^{n}$ as $hat{X}^{n}=mathrm{E}[X^{n}|Y^{n}]$. However, for general sources, computing $mathrm{E}[X^{n}|Y^{n}]$ is computationally very challenging, if not infeasible. In this paper, starting from a Bayesian set-up, a novel denoising method, namely, quantized maximum a posteriori (Q-MAP) denoiser is proposed and its asymptotic performance is analysed. Both for memoryless sources, and for structured first-order Markov sources, it is shown that, asymptotically, as $sigma _{z}^{2} $ (noise variance) converges to zero, ${1over sigma _{z}^{2}} mathrm{E}[(X_{i}-hat{X}^{mathrm{QMAP}}_{i})^{2}]$ converges to the information dimension of the source. For the studied memoryless sources, this limit is known to be optimal. A key advantage of the Q-MAP denoiser, unlike an MMSE denoiser, is that it highlights the key properties of the source distribution that are to be used in its denoising. This key property leads to a new learning-based denoising approach that is applicable to generic structured sources. Using ImageNet database for training, initial simulation results exploring the performance of such a learning-based denoiser in image denoising are presented.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135060543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Near-optimal estimation of linear functionals with log-concave observation errors 具有log-凹观测误差的线性泛函的近最优估计
4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-09-19 DOI: 10.1093/imaiai/iaad038
Simon Foucart, Grigoris Paouris
Abstract This note addresses the question of optimally estimating a linear functional of an object acquired through linear observations corrupted by random noise, where optimality pertains to a worst-case setting tied to a symmetric, convex and closed model set containing the object. It complements the article ‘Statistical Estimation and Optimal Recovery’ published in the Annals of Statistics in 1994. There, Donoho showed (among other things) that, for Gaussian noise, linear maps provide near-optimal estimation schemes relatively to a performance measure relevant in Statistical Estimation. Here, we advocate for a different performance measure arguably more relevant in Optimal Recovery. We show that, relatively to this new measure, linear maps still provide near-optimal estimation schemes even if the noise is merely log-concave. Our arguments, which make a connection to the deterministic noise situation and bypass properties specific to the Gaussian case, offer an alternative to parts of Donoho’s proof.
摘要:本文解决了通过随机噪声破坏的线性观测获得的对象的线性泛函的最优估计问题,其中最优性涉及与包含该对象的对称,凸和封闭模型集相关的最坏情况设置。它补充了1994年发表在《统计年鉴》上的文章“统计估计和最佳恢复”。在那里,Donoho展示了(除其他外),对于高斯噪声,相对于统计估计中相关的性能度量,线性映射提供了接近最优的估计方案。在这里,我们提倡一种不同的性能度量,可以说在最优恢复中更相关。我们表明,相对于这种新的测量方法,即使噪声仅仅是对数凹的,线性映射仍然提供接近最优的估计方案。我们的论点与确定性噪声情况和高斯情况特有的旁路特性有关,为多诺霍的部分证明提供了另一种选择。
{"title":"Near-optimal estimation of linear functionals with log-concave observation errors","authors":"Simon Foucart, Grigoris Paouris","doi":"10.1093/imaiai/iaad038","DOIUrl":"https://doi.org/10.1093/imaiai/iaad038","url":null,"abstract":"Abstract This note addresses the question of optimally estimating a linear functional of an object acquired through linear observations corrupted by random noise, where optimality pertains to a worst-case setting tied to a symmetric, convex and closed model set containing the object. It complements the article ‘Statistical Estimation and Optimal Recovery’ published in the Annals of Statistics in 1994. There, Donoho showed (among other things) that, for Gaussian noise, linear maps provide near-optimal estimation schemes relatively to a performance measure relevant in Statistical Estimation. Here, we advocate for a different performance measure arguably more relevant in Optimal Recovery. We show that, relatively to this new measure, linear maps still provide near-optimal estimation schemes even if the noise is merely log-concave. Our arguments, which make a connection to the deterministic noise situation and bypass properties specific to the Gaussian case, offer an alternative to parts of Donoho’s proof.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135010731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph-based approximate message passing iterations 基于图的近似消息传递迭代
4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-09-18 DOI: 10.1093/imaiai/iaad020
Cédric Gerbelot, Raphaël Berthier
Abstract Approximate message passing (AMP) algorithms have become an important element of high-dimensional statistical inference, mostly due to their adaptability and concentration properties, the state evolution (SE) equations. This is demonstrated by the growing number of new iterations proposed for increasingly complex problems, ranging from multi-layer inference to low-rank matrix estimation with elaborate priors. In this paper, we address the following questions: is there a structure underlying all AMP iterations that unifies them in a common framework? Can we use such a structure to give a modular proof of state evolution equations, adaptable to new AMP iterations without reproducing each time the full argument? We propose an answer to both questions, showing that AMP instances can be generically indexed by an oriented graph. This enables to give a unified interpretation of these iterations, independent from the problem they solve, and a way of composing them arbitrarily. We then show that all AMP iterations indexed by such a graph verify rigorous SE equations, extending the reach of previous proofs and proving a number of recent heuristic derivations of those equations. Our proof naturally includes non-separable functions and we show how existing refinements, such as spatial coupling or matrix-valued variables, can be combined with our framework.
摘要近似消息传递(AMP)算法已成为高维统计推断的重要组成部分,主要是由于其自适应性和集中性,状态演化(SE)方程。对于越来越复杂的问题,从多层推理到具有精细先验的低秩矩阵估计,提出了越来越多的新迭代,证明了这一点。在本文中,我们解决了以下问题:是否存在一个将所有AMP迭代统一在一个公共框架中的结构?我们是否可以使用这样的结构来给出状态演化方程的模块化证明,以适应新的AMP迭代,而无需每次都复制完整的参数?我们对这两个问题给出了答案,表明AMP实例可以通过面向图进行一般索引。这使得可以对这些迭代给出统一的解释,独立于它们所解决的问题,并且可以任意地组合它们。然后,我们证明了由这样一个图索引的所有AMP迭代都验证了严格的SE方程,扩展了以前证明的范围,并证明了这些方程的一些最近的启发式推导。我们的证明自然包括不可分离的函数,我们展示了如何现有的改进,如空间耦合或矩阵值变量,可以与我们的框架相结合。
{"title":"Graph-based approximate message passing iterations","authors":"Cédric Gerbelot, Raphaël Berthier","doi":"10.1093/imaiai/iaad020","DOIUrl":"https://doi.org/10.1093/imaiai/iaad020","url":null,"abstract":"Abstract Approximate message passing (AMP) algorithms have become an important element of high-dimensional statistical inference, mostly due to their adaptability and concentration properties, the state evolution (SE) equations. This is demonstrated by the growing number of new iterations proposed for increasingly complex problems, ranging from multi-layer inference to low-rank matrix estimation with elaborate priors. In this paper, we address the following questions: is there a structure underlying all AMP iterations that unifies them in a common framework? Can we use such a structure to give a modular proof of state evolution equations, adaptable to new AMP iterations without reproducing each time the full argument? We propose an answer to both questions, showing that AMP instances can be generically indexed by an oriented graph. This enables to give a unified interpretation of these iterations, independent from the problem they solve, and a way of composing them arbitrarily. We then show that all AMP iterations indexed by such a graph verify rigorous SE equations, extending the reach of previous proofs and proving a number of recent heuristic derivations of those equations. Our proof naturally includes non-separable functions and we show how existing refinements, such as spatial coupling or matrix-valued variables, can be combined with our framework.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135110705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spectral deconvolution of matrix models: the additive case 矩阵模型的谱反褶积:加性情况
4区 数学 Q2 MATHEMATICS, APPLIED Pub Date : 2023-09-18 DOI: 10.1093/imaiai/iaad037
Pierre Tarrago
Abstract We implement a complex analytic method to build an estimator of the spectrum of a matrix perturbed by the addition of a random matrix noise in the free probabilistic regime. This method, which has been previously introduced by Arizmendi, Tarrago and Vargas, involves two steps: the first step consists in a fixed point method to compute the Stieltjes transform of the desired distribution in a certain domain, and the second step is a classical deconvolution by a Cauchy distribution, whose parameter depends on the intensity of the noise. This method thus reduces the spectral deconvolution problem to a classical one. We provide explicit bounds for the mean squared error of the first step under the assumption that the distribution of the noise is unitary invariant. In the case where the unknown measure is sparse or close to a distribution with a density with enough smoothness, we prove that the resulting estimator converges to the measure in the $1$-Wasserstein distance at speed $O(1/sqrt{N})$, where $N$ is the dimension of the matrix.
摘要利用复解析的方法,建立了在自由概率域中被随机矩阵噪声扰动的矩阵谱的估计量。该方法由Arizmendi、Tarrago和Vargas提出,分为两步:第一步采用不动点法计算期望分布在某一区域的Stieltjes变换,第二步采用柯西分布进行经典反卷积,柯西分布的参数取决于噪声的强度。该方法将光谱反褶积问题简化为经典问题。在假设噪声的分布是酉不变的情况下,我们为第一步的均方误差提供了明确的界限。在未知测度是稀疏的或接近一个密度足够光滑的分布的情况下,我们证明了所得估计量以速度$O(1/sqrt{N})$收敛于$1$-Wasserstein距离上的测度,其中$N$是矩阵的维数。
{"title":"Spectral deconvolution of matrix models: the additive case","authors":"Pierre Tarrago","doi":"10.1093/imaiai/iaad037","DOIUrl":"https://doi.org/10.1093/imaiai/iaad037","url":null,"abstract":"Abstract We implement a complex analytic method to build an estimator of the spectrum of a matrix perturbed by the addition of a random matrix noise in the free probabilistic regime. This method, which has been previously introduced by Arizmendi, Tarrago and Vargas, involves two steps: the first step consists in a fixed point method to compute the Stieltjes transform of the desired distribution in a certain domain, and the second step is a classical deconvolution by a Cauchy distribution, whose parameter depends on the intensity of the noise. This method thus reduces the spectral deconvolution problem to a classical one. We provide explicit bounds for the mean squared error of the first step under the assumption that the distribution of the noise is unitary invariant. In the case where the unknown measure is sparse or close to a distribution with a density with enough smoothness, we prove that the resulting estimator converges to the measure in the $1$-Wasserstein distance at speed $O(1/sqrt{N})$, where $N$ is the dimension of the matrix.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135109334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information and Inference-A Journal of the Ima
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1