首页 > 最新文献

Electronic Journal of Statistics最新文献

英文 中文
Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem 高维基尼相关的渐近正态性及其在k -样本问题中的应用
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2165
Yongli Sang, Xin Dang
The categorical Gini correlation proposed by Dang et al. [7] is a dependence measure to characterize independence between categorical and numerical variables. The asymptotic distributions of the sample correlation under dependence and independence have been established when the dimension of the numerical variable is fixed. However, its asymptotic behavior for high dimensional data has not been explored. In this paper, we develop the central limit theorem for the Gini correlation in the more realistic setting where the dimensionality of the numerical variable is diverging. We then construct a powerful and consistent test for the K-sample problem based on the asymptotic normality. The proposed test not only avoids computation burden but also gains power over the permutation procedure. Simulation studies and real data illustrations show that the proposed test is more competitive to existing methods across a broad range of realistic situations, especially in unbalanced cases.
Dang等人[7]提出的分类基尼相关是描述分类变量和数值变量之间独立性的依赖度量。在数值变量维数一定的情况下,建立了样本相关性在依赖和独立条件下的渐近分布。然而,对于高维数据,它的渐近行为还没有被探索。在本文中,我们在数值变量维数发散的更现实的情况下,发展了基尼相关的中心极限定理。然后,我们基于渐近正态性构造了k -样本问题的强大且一致的检验。该方法不仅避免了计算量的增加,而且大大优于排列过程。仿真研究和实际数据插图表明,在广泛的现实情况下,特别是在不平衡情况下,所提出的测试比现有方法更具竞争力。
{"title":"Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem","authors":"Yongli Sang, Xin Dang","doi":"10.1214/23-ejs2165","DOIUrl":"https://doi.org/10.1214/23-ejs2165","url":null,"abstract":"The categorical Gini correlation proposed by Dang et al. [7] is a dependence measure to characterize independence between categorical and numerical variables. The asymptotic distributions of the sample correlation under dependence and independence have been established when the dimension of the numerical variable is fixed. However, its asymptotic behavior for high dimensional data has not been explored. In this paper, we develop the central limit theorem for the Gini correlation in the more realistic setting where the dimensionality of the numerical variable is diverging. We then construct a powerful and consistent test for the K-sample problem based on the asymptotic normality. The proposed test not only avoids computation burden but also gains power over the permutation procedure. Simulation studies and real data illustrations show that the proposed test is more competitive to existing methods across a broad range of realistic situations, especially in unbalanced cases.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135450743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Pretest estimation in combining probability and non-probability samples 结合概率和非概率样本的预测估计
IF 1.1 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2137
Chenyin Gao, Shu Yang
Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
在大数据时代,多种异构数据源越来越多地可用于统计分析。作为有限总体推断中的一个重要例子,我们通过结合金标准概率和非概率样本,开发了一个通用参数估计的测试和池方法的统一框架。我们关注的情况是,在两个数据集中都观察到研究变量,用于估计目标参数,并且每个数据集都包含其他辅助变量。利用概率设计,我们进行了一个预测试程序,以确定非概率数据与概率数据的可比性,并决定是否在汇总分析中利用非概率数据。当概率数据和非概率数据具有可比性时,我们的方法将这两个数据结合起来进行有效估计。否则,我们只保留概率数据进行估计。我们还刻画了所提出的测试和池估计器在局部备选方案下的渐近分布,并提供了一个数据自适应过程来选择关键的调谐参数,该参数的目标是测试和池估计量的最小均方误差。最后,为了解决检验和池估计的非正则性,我们构造了一个具有良好有限样本覆盖性质的鲁棒置信区间。
{"title":"Pretest estimation in combining probability and non-probability samples","authors":"Chenyin Gao, Shu Yang","doi":"10.1214/23-ejs2137","DOIUrl":"https://doi.org/10.1214/23-ejs2137","url":null,"abstract":"Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45146367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Selective inference for clustering with unknown variance 未知方差聚类的选择推理
IF 1.1 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2143
Y. Yun, R. Barber
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
在许多现代统计问题中,必须使用有限的可用数据来开发要检验的假设,并检验这些假设,即用于探索性和验证性数据分析。重复使用相同的数据集进行勘探和测试可能会导致大量的选择偏差,导致许多错误的发现。选择性推理是一种框架,即使在重复使用相同的数据进行探索和测试时,也可以执行有效的推理。在这项工作中,我们对数据聚类的选择性推理问题感兴趣,其中使用聚类过程来假设将数据点分离为一组子群,然后我们希望测试这些依赖数据的聚类是否真的代表了数据中有意义的差异。Gao等人最近的工作[2022]为这种设置提供了一个进行选择性推理的框架,其中使用分层聚类算法来产生聚类分配,然后由Chen和Witten[2022]将其扩展到k-means聚类。这两项工作都依赖于假设数据的已知协方差结构,但在实践中,需要估计噪声水平,当真正的聚类结构未知时,这尤其具有挑战性。在我们的工作中,我们将这项工作扩展到具有未知方差的噪声设置,并为这种更普遍的设置提供了一种选择性推理方法。实验结果表明,当真实噪声水平未知时,我们的新方法能够更好地保持高功率,同时控制I型误差。
{"title":"Selective inference for clustering with unknown variance","authors":"Y. Yun, R. Barber","doi":"10.1214/23-ejs2143","DOIUrl":"https://doi.org/10.1214/23-ejs2143","url":null,"abstract":"In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48717813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Design and analysis of bipartite experiments under a linear exposure-response model 线性暴露-响应模型下的二部实验设计与分析
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2111
Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, Jean Pouget-Abadie
A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. Finally, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.
一个双部分实验包括一组被分配治疗的单位和另一组我们测量结果的单位。这两组单元通过二部图连接起来,控制处理单元如何影响结果单元。在线性暴露-响应模型下,我们考虑在二部实验框架下的平均总处理效果的估计。我们引入了曝光重加权线性(ERL)估计量,并证明了在二部图足够稀疏的条件下,该估计量是无偏的、一致的和渐近正态的。为了便于推理,我们引入了ERL点估计量方差的无偏一致估计量。最后,我们介绍了一种基于聚类的设计,曝光设计,它使用启发式方法通过实现理想的曝光分布来提高ERL估计器的精度。
{"title":"Design and analysis of bipartite experiments under a linear exposure-response model","authors":"Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, Jean Pouget-Abadie","doi":"10.1214/23-ejs2111","DOIUrl":"https://doi.org/10.1214/23-ejs2111","url":null,"abstract":"A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. Finally, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135470674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Semi-parametric inference for large-scale data with temporally dependent noise 具有时间相关噪声的大规模数据的半参数推理
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2171
Chunming Zhang, Xiao Guo, Min Chen, Xinze Du
Temporal dependence is frequently encountered in large-scale structured noisy data, arising from scientific studies in neuroscience and meteorology, among others. This challenging characteristic may not align with existing theoretical frameworks or data analysis tools. Motivated by multi-session fMRI time series data, this paper introduces a novel semi-parametric inference procedure suitable for a broad class of “non-stationary, non-Gaussian, temporally dependent” noise processes in time-course data. It develops a new test statistic based on a tapering-type estimator of the large-dimensional noise auto-covariance matrix and establishes its asymptotic chi-squared distribution. Our method not only relaxes the consistency requirement for the noise covariance matrix estimator but also avoids direct matrix inversion without sacrificing detection power. It adapts well to both stationary and a wider range of temporal noise processes, making it particularly effective for handling challenging scenarios involving very large scales of data and large dimensions of noise covariance matrices. We demonstrate the efficacy of the proposed procedure through simulation evaluations and real fMRI data analysis.
在神经科学和气象学等领域的科学研究中,时间依赖性经常出现在大规模结构化噪声数据中。这个具有挑战性的特征可能与现有的理论框架或数据分析工具不一致。受多时段fMRI时间序列数据的启发,本文介绍了一种新的半参数推理方法,适用于时间过程数据中广泛的“非平稳、非高斯、时间相关”噪声过程。提出了一种基于大维噪声自协方差矩阵的锥型估计量的检验统计量,并建立了其渐近卡方分布。该方法不仅放宽了对噪声协方差矩阵估计的一致性要求,而且在不牺牲检测能力的情况下避免了矩阵的直接反演。它可以很好地适应平稳和更大范围的时间噪声过程,使其特别有效地处理涉及非常大数据规模和大尺寸噪声协方差矩阵的具有挑战性的场景。我们通过模拟评估和真实fMRI数据分析证明了所提出程序的有效性。
{"title":"Semi-parametric inference for large-scale data with temporally dependent noise","authors":"Chunming Zhang, Xiao Guo, Min Chen, Xinze Du","doi":"10.1214/23-ejs2171","DOIUrl":"https://doi.org/10.1214/23-ejs2171","url":null,"abstract":"Temporal dependence is frequently encountered in large-scale structured noisy data, arising from scientific studies in neuroscience and meteorology, among others. This challenging characteristic may not align with existing theoretical frameworks or data analysis tools. Motivated by multi-session fMRI time series data, this paper introduces a novel semi-parametric inference procedure suitable for a broad class of “non-stationary, non-Gaussian, temporally dependent” noise processes in time-course data. It develops a new test statistic based on a tapering-type estimator of the large-dimensional noise auto-covariance matrix and establishes its asymptotic chi-squared distribution. Our method not only relaxes the consistency requirement for the noise covariance matrix estimator but also avoids direct matrix inversion without sacrificing detection power. It adapts well to both stationary and a wider range of temporal noise processes, making it particularly effective for handling challenging scenarios involving very large scales of data and large dimensions of noise covariance matrices. We demonstrate the efficacy of the proposed procedure through simulation evaluations and real fMRI data analysis.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135610487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online inference in high-dimensional generalized linear models with streaming data. 使用流数据的高维广义线性模型在线推理。
IF 1 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 Epub Date: 2023-11-28 DOI: 10.1214/23-ejs2182
Lan Luo, Ruijian Han, Yuanyuan Lin, Jian Huang

In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method that aligns with the data collection scheme of streaming data. Online debiased lasso differs from offline debiased lasso in two important aspects. First, it updates component-wise confidence intervals of regression coefficients with only summary statistics of the historical data. Second, online debiased lasso adds an additional term to correct approximation errors accumulated throughout the online updating procedure. We show that our proposed online debiased estimators in generalized linear models are asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of our proposed online debiased lasso method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. Furthermore, we illustrate the application of our method with a high-dimensional text dataset.

在本文中,我们针对高维广义线性模型开发了一种流数据在线统计推断方法,用于实时估计和推断。我们提出了一种与流数据的数据收集方案相匹配的在线debiased lasso方法。在线去偏拉索与离线去偏拉索在两个重要方面有所不同。首先,它只使用历史数据的摘要统计来更新回归系数的分量置信区间。其次,在线去偏 lasso 增加了一个项,以纠正在线更新过程中积累的近似误差。我们证明,在广义线性模型中,我们提出的在线去偏估计器是渐近正态的。这一结果为利用流数据进行实时临时统计推断提供了理论基础。为了评估我们提出的在线去偏拉索方法的性能,我们进行了大量的数值实验。这些实验证明了我们算法的有效性,并支持理论结果。此外,我们还用一个高维文本数据集说明了我们的方法的应用。
{"title":"Online inference in high-dimensional generalized linear models with streaming data.","authors":"Lan Luo, Ruijian Han, Yuanyuan Lin, Jian Huang","doi":"10.1214/23-ejs2182","DOIUrl":"10.1214/23-ejs2182","url":null,"abstract":"<p><p>In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method that aligns with the data collection scheme of streaming data. Online debiased lasso differs from offline debiased lasso in two important aspects. First, it updates component-wise confidence intervals of regression coefficients with only summary statistics of the historical data. Second, online debiased lasso adds an additional term to correct approximation errors accumulated throughout the online updating procedure. We show that our proposed online debiased estimators in generalized linear models are asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of our proposed online debiased lasso method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. Furthermore, we illustrate the application of our method with a high-dimensional text dataset.</p>","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"17 2","pages":"3443-3471"},"PeriodicalIF":1.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11346802/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning for inverse problems with unknown operator 未知算子逆问题的深度学习
IF 1.1 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2114
Miguel del Álamo
We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of"many"training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to"fit"the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.
{"title":"Deep learning for inverse problems with unknown operator","authors":"Miguel del Álamo","doi":"10.1214/23-ejs2114","DOIUrl":"https://doi.org/10.1214/23-ejs2114","url":null,"abstract":"We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of\"many\"training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to\"fit\"the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"NS24 3 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77828556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sufficient variable screening with high-dimensional controls 充分的变量筛选与高维控制
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2150
Chenlu Ke
Variable screening for ultrahigh-dimensional data has attracted extensive attention in the past decade. In many applications, researchers learn from previous studies about certain important predictors or control variables related to the response of interest. Such knowledge should be taken into account in the screening procedure. The development of variable screening conditional on prior information, however, has been less fruitful, compared to the vast literature for generic unconditional screening. In this paper, we propose a model-free variable screening paradigm that allows for high-dimensional controls and applies to either continuous or categorical responses. The contribution of each individual predictor is quantified marginally and conditionally in the presence of the control variables as well as the other candidates by reproducing-kernel-based R2 and partial R2 statistics. As a result, the proposed method enjoys the sure screening property and the rank consistency property in the notion of sufficiency, with which its superiority over existing methods is well-established. The advantages of the proposed method are demonstrated by simulation studies encompassing a variety of regression and classification models, and an application to high-throughput gene expression data.
超高维数据的可变筛选在过去的十年中引起了广泛的关注。在许多应用中,研究人员从先前的研究中了解到与兴趣反应相关的某些重要预测因子或控制变量。在筛选程序中应考虑到这些知识。然而,与大量的通用无条件筛选文献相比,基于先验信息的可变筛选的发展成果较少。在本文中,我们提出了一个无模型的变量筛选范式,允许高维控制,并适用于连续或分类响应。通过再现基于核的R2和部分R2统计,在控制变量以及其他候选变量存在的情况下,对每个个体预测因子的贡献进行边际和有条件的量化。结果表明,该方法在充分性概念下具有确定的筛选性和秩一致性,证明了其优于现有方法的优越性。包含各种回归和分类模型的模拟研究以及对高通量基因表达数据的应用证明了所提出方法的优势。
{"title":"Sufficient variable screening with high-dimensional controls","authors":"Chenlu Ke","doi":"10.1214/23-ejs2150","DOIUrl":"https://doi.org/10.1214/23-ejs2150","url":null,"abstract":"Variable screening for ultrahigh-dimensional data has attracted extensive attention in the past decade. In many applications, researchers learn from previous studies about certain important predictors or control variables related to the response of interest. Such knowledge should be taken into account in the screening procedure. The development of variable screening conditional on prior information, however, has been less fruitful, compared to the vast literature for generic unconditional screening. In this paper, we propose a model-free variable screening paradigm that allows for high-dimensional controls and applies to either continuous or categorical responses. The contribution of each individual predictor is quantified marginally and conditionally in the presence of the control variables as well as the other candidates by reproducing-kernel-based R2 and partial R2 statistics. As a result, the proposed method enjoys the sure screening property and the rank consistency property in the notion of sufficiency, with which its superiority over existing methods is well-established. The advantages of the proposed method are demonstrated by simulation studies encompassing a variety of regression and classification models, and an application to high-throughput gene expression data.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135911130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure learning via unstructured kernel-based M-estimation 基于非结构化核的m估计的结构学习
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2153
Xin He, Yeheng Ge, Xingdong Feng
In statistical learning, identifying underlying structures of true target functions based on observed data plays a crucial role to facilitate subsequent modeling and analysis. Unlike most of those existing methods that focus on some specific settings under certain model assumptions, a general and novel framework is proposed for recovering the true structures of target functions by using unstructured M-estimation in a reproducing kernel Hilbert space (RKHS) in this paper. This framework is inspired by the fact that gradient functions can be employed as a valid tool to learn underlying structures, including sparse learning, interaction selection and model identification, and it is easy to implement by taking advantage of some nice properties of the RKHS. More importantly, it admits a wide range of loss functions, and thus includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification, which is also computationally efficient by solving convex optimization tasks. The asymptotic results of the proposed framework are established within a rich family of loss functions without any explicit model specifications. The superior performance of the proposed framework is also demonstrated by a variety of simulated examples and a real case study.
在统计学习中,基于观测数据识别真实目标函数的底层结构对于后续建模和分析至关重要。与现有的大多数方法不同,本文提出了一种利用再现核希尔伯特空间(RKHS)中的非结构化m估计来恢复目标函数真实结构的通用框架。该框架的灵感来自于梯度函数可以作为一种有效的工具来学习底层结构,包括稀疏学习、交互选择和模型识别,并且通过利用RKHS的一些很好的特性很容易实现。更重要的是,它允许广泛的损失函数,因此包括许多常用的方法作为特殊情况,如均值回归、分位数回归、基于似然的分类、基于边缘的分类,这些方法通过求解凸优化任务也具有计算效率。该框架的渐近结果建立在一个丰富的损失函数族中,不需要任何显式的模型规范。通过各种仿真实例和实际案例研究,证明了该框架的优越性能。
{"title":"Structure learning via unstructured kernel-based M-estimation","authors":"Xin He, Yeheng Ge, Xingdong Feng","doi":"10.1214/23-ejs2153","DOIUrl":"https://doi.org/10.1214/23-ejs2153","url":null,"abstract":"In statistical learning, identifying underlying structures of true target functions based on observed data plays a crucial role to facilitate subsequent modeling and analysis. Unlike most of those existing methods that focus on some specific settings under certain model assumptions, a general and novel framework is proposed for recovering the true structures of target functions by using unstructured M-estimation in a reproducing kernel Hilbert space (RKHS) in this paper. This framework is inspired by the fact that gradient functions can be employed as a valid tool to learn underlying structures, including sparse learning, interaction selection and model identification, and it is easy to implement by taking advantage of some nice properties of the RKHS. More importantly, it admits a wide range of loss functions, and thus includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification, which is also computationally efficient by solving convex optimization tasks. The asymptotic results of the proposed framework are established within a rich family of loss functions without any explicit model specifications. The superior performance of the proposed framework is also demonstrated by a variety of simulated examples and a real case study.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135958438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation of the Hurst parameter from continuous noisy data 连续噪声数据中Hurst参数的估计
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-01-01 DOI: 10.1214/23-ejs2156
Pavel Chigansky, Marina Kleptsyna
This paper addresses the problem of estimating the Hurst exponent of the fractional Brownian motion from continuous time noisy sample. When the Hurst parameter is greater than 3∕4, consistent estimation is possible only if either the length of the observation interval increases to infinity or intensity of the noise decreases to zero. The main result is a proof of the Local Asymptotic Normality (LAN) of the model in these two regimes which reveals the optimal minimax estimation rates.
研究了从连续时间噪声样本中估计分数阶布朗运动的赫斯特指数的问题。当赫斯特参数大于3∕4时,只有当观测区间的长度增大到无穷大或噪声强度减小到零时,才有可能进行一致估计。主要结果是证明了模型在这两种情况下的局部渐近正态性,揭示了最优的极大极小估计率。
{"title":"Estimation of the Hurst parameter from continuous noisy data","authors":"Pavel Chigansky, Marina Kleptsyna","doi":"10.1214/23-ejs2156","DOIUrl":"https://doi.org/10.1214/23-ejs2156","url":null,"abstract":"This paper addresses the problem of estimating the Hurst exponent of the fractional Brownian motion from continuous time noisy sample. When the Hurst parameter is greater than 3∕4, consistent estimation is possible only if either the length of the observation interval increases to infinity or intensity of the noise decreases to zero. The main result is a proof of the Local Asymptotic Normality (LAN) of the model in these two regimes which reveals the optimal minimax estimation rates.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135954992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Electronic Journal of Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1