首页 > 最新文献

Statistica Sinica最新文献

英文 中文
A Data Fusion Method for Quantile Treatment Effects 一种分位数处理效果的数据融合方法
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-07-16 DOI: 10.5705/ss.202022.0288
Yijiao Zhang, Zhongyi Zhu
With the increasing availability of datasets, developing data fusion methods to leverage the strengths of different datasets to draw causal effects is of great practical importance to many scientific fields. In this paper, we consider estimating the quantile treatment effects using small validation data with fully-observed confounders and large auxiliary data with unmeasured confounders. We propose a Fused Quantile Treatment effects Estimator (FQTE) by integrating the information from two datasets based on doubly robust estimating functions. We allow for the misspecification of the models on the dataset with unmeasured confounders. Under mild conditions, we show that the proposed FQTE is asymptotically normal and more efficient than the initial QTE estimator using the validation data solely. By establishing the asymptotic linear forms of related estimators, convenient methods for covariance estimation are provided. Simulation studies demonstrate the empirical validity and improved efficiency of our fused estimators. We illustrate the proposed method with an application.
随着数据集的可用性越来越高,开发数据融合方法来利用不同数据集的优势来得出因果效应对许多科学领域都具有重要的现实意义。在本文中,我们考虑使用具有完全观察到的混杂因素的小验证数据和具有未测量混杂因素的大辅助数据来估计分位数治疗效果。提出了一种基于双鲁棒估计函数的融合分位数处理效果估计器(FQTE)。我们允许使用未测量的混杂因素对数据集上的模型进行错误规范。在温和的条件下,我们证明了所提出的FQTE是渐近正态的,并且比仅使用验证数据的初始QTE估计器更有效。通过建立相关估计量的渐近线性形式,提供了方便的协方差估计方法。仿真研究证明了该融合估计器的经验有效性和提高的效率。我们用一个应用来说明所提出的方法。
{"title":"A Data Fusion Method for Quantile Treatment Effects","authors":"Yijiao Zhang, Zhongyi Zhu","doi":"10.5705/ss.202022.0288","DOIUrl":"https://doi.org/10.5705/ss.202022.0288","url":null,"abstract":"With the increasing availability of datasets, developing data fusion methods to leverage the strengths of different datasets to draw causal effects is of great practical importance to many scientific fields. In this paper, we consider estimating the quantile treatment effects using small validation data with fully-observed confounders and large auxiliary data with unmeasured confounders. We propose a Fused Quantile Treatment effects Estimator (FQTE) by integrating the information from two datasets based on doubly robust estimating functions. We allow for the misspecification of the models on the dataset with unmeasured confounders. Under mild conditions, we show that the proposed FQTE is asymptotically normal and more efficient than the initial QTE estimator using the validation data solely. By establishing the asymptotic linear forms of related estimators, convenient methods for covariance estimation are provided. Simulation studies demonstrate the empirical validity and improved efficiency of our fused estimators. We illustrate the proposed method with an application.","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42371328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PARTIALLY FUNCTIONAL LINEAR QUANTILE REGRESSION WITH MEASUREMENT ERRORS. 有测量误差的部分函数线性量回归。
IF 1.5 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-07-01 DOI: 10.5705/ss.202021.0246
Mengli Zhang, Lan Xue, Carmen D Tekwe, Yang Bai, Annie Qu

Ignoring measurement errors in conventional regression analyses can lead to biased estimation and inference results. Reducing such bias is challenging when the error-prone covariate is a functional curve. In this paper, we propose a new corrected loss function for a partially functional linear quantile model with function-valued measurement errors. We establish the asymptotic properties of both the functional coefficient and the parametric coefficient estimators. We also demonstrate the finite-sample performance of the proposed method using simulation studies, and illustrate its advantages by applying it to data from a children obesity study.

在传统回归分析中忽略测量误差会导致估计和推断结果出现偏差。当容易产生误差的协变量是函数曲线时,减少这种偏差具有挑战性。在本文中,我们为具有函数值测量误差的部分函数线性量化模型提出了一种新的修正损失函数。我们建立了函数系数估计器和参数系数估计器的渐近特性。我们还通过模拟研究证明了所提方法的有限样本性能,并将其应用于一项儿童肥胖症研究的数据中,从而说明了该方法的优势。
{"title":"PARTIALLY FUNCTIONAL LINEAR QUANTILE REGRESSION WITH MEASUREMENT ERRORS.","authors":"Mengli Zhang, Lan Xue, Carmen D Tekwe, Yang Bai, Annie Qu","doi":"10.5705/ss.202021.0246","DOIUrl":"10.5705/ss.202021.0246","url":null,"abstract":"<p><p>Ignoring measurement errors in conventional regression analyses can lead to biased estimation and inference results. Reducing such bias is challenging when the error-prone covariate is a functional curve. In this paper, we propose a new corrected loss function for a partially functional linear quantile model with function-valued measurement errors. We establish the asymptotic properties of both the functional coefficient and the parametric coefficient estimators. We also demonstrate the finite-sample performance of the proposed method using simulation studies, and illustrate its advantages by applying it to data from a children obesity study.</p>","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":"1 1","pages":"2257-2280"},"PeriodicalIF":1.5,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11346807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70937511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comparison of Estimators of Mean and Its Functions in Finite Populations 有限总体中均值及其函数估计量的比较
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-24 DOI: 10.5705/ss.202022.0181
Anurag Dey, P. Chaudhuri
Several well known estimators of finite population mean and its functions are investigated under some standard sampling designs. Such functions of mean include the variance, the correlation coefficient and the regression coefficient in the population as special cases. We compare the performance of these estimators under different sampling designs based on their asymptotic distributions. Equivalence classes of estimators under different sampling designs are constructed so that estimators in the same class have equivalent performance in terms of asymptotic mean squared errors (MSEs). Estimators in different equivalence classes are then compared under some superpopulations satisfying linear models. It is shown that the pseudo empirical likelihood (PEML) estimator of the population mean under simple random sampling without replacement (SRSWOR) has the lowest asymptotic MSE among all the estimators under different sampling designs considered in this paper. It is also shown that for the variance, the correlation coefficient and the regression coefficient of the population, the plug-in estimators based on the PEML estimator have the lowest asymptotic MSEs among all the estimators considered in this paper under SRSWOR. On the other hand, for any high entropy $pi$PS (HE$pi$PS) sampling design, which uses the auxiliary information, the plug-in estimators of those parameters based on the H'ajek estimator have the lowest asymptotic MSEs among all the estimators considered in this paper.
在一些标准抽样设计下,研究了有限总体均值及其函数的几个已知估计量。作为特殊情况,这些均值函数包括总体中的方差、相关系数和回归系数。我们根据这些估计量的渐近分布比较了它们在不同抽样设计下的性能。构造了不同抽样设计下估计量的等价类,使同一类的估计量在渐近均方误差方面具有等价的性能。然后在满足线性模型的超总体下比较了不同等价类的估计量。结果表明,在本文考虑的不同抽样设计下,总体均值的伪经验似然估计量(PEML)具有最低的渐近均方误差。对于总体的方差、相关系数和回归系数,基于PEML估计量的插件估计量在SRSWOR下具有最低的渐近均方误差。另一方面,对于任何使用辅助信息的高熵$pi$PS (HE$pi$PS)采样设计,基于H ajek估计量的这些参数的插入估计量在本文考虑的所有估计量中具有最低的渐近均方差。
{"title":"A Comparison of Estimators of Mean and Its Functions in Finite Populations","authors":"Anurag Dey, P. Chaudhuri","doi":"10.5705/ss.202022.0181","DOIUrl":"https://doi.org/10.5705/ss.202022.0181","url":null,"abstract":"Several well known estimators of finite population mean and its functions are investigated under some standard sampling designs. Such functions of mean include the variance, the correlation coefficient and the regression coefficient in the population as special cases. We compare the performance of these estimators under different sampling designs based on their asymptotic distributions. Equivalence classes of estimators under different sampling designs are constructed so that estimators in the same class have equivalent performance in terms of asymptotic mean squared errors (MSEs). Estimators in different equivalence classes are then compared under some superpopulations satisfying linear models. It is shown that the pseudo empirical likelihood (PEML) estimator of the population mean under simple random sampling without replacement (SRSWOR) has the lowest asymptotic MSE among all the estimators under different sampling designs considered in this paper. It is also shown that for the variance, the correlation coefficient and the regression coefficient of the population, the plug-in estimators based on the PEML estimator have the lowest asymptotic MSEs among all the estimators considered in this paper under SRSWOR. On the other hand, for any high entropy $pi$PS (HE$pi$PS) sampling design, which uses the auxiliary information, the plug-in estimators of those parameters based on the H'ajek estimator have the lowest asymptotic MSEs among all the estimators considered in this paper.","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47403235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Greedy Search Algorithm for High-dimensional Linear Discriminant Analysis. 一种高效的高维线性判别分析贪心搜索算法。
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-01 DOI: 10.5705/ss.202021.0028
Hannan Yang, D Y Lin, Quefeng Li

High-dimensional classification is an important statistical problem that has applications in many areas. One widely used classifier is the Linear Discriminant Analysis (LDA). In recent years, many regularized LDA classifiers have been proposed to solve the problem of high-dimensional classification. However, these methods rely on inverting a large matrix or solving large-scale optimization problems to render classification rules-methods that are computationally prohibitive when the dimension is ultra-high. With the emergence of big data, it is increasingly important to develop more efficient algorithms to solve the high-dimensional LDA problem. In this paper, we propose an efficient greedy search algorithm that depends solely on closed-form formulae to learn a high-dimensional LDA rule. We establish theoretical guarantee of its statistical properties in terms of variable selection and error rate consistency; in addition, we provide an explicit interpretation of the extra information brought by an additional feature in a LDA problem under some mild distributional assumptions. We demonstrate that this new algorithm drastically improves computational speed compared with other high-dimensional LDA methods, while maintaining comparable or even better classification performance.

高维分类是一个重要的统计问题,在许多领域都有应用。一个广泛使用的分类器是线性判别分析(LDA)。近年来,为了解决高维分类问题,提出了许多正则化LDA分类器。然而,这些方法依赖于反转一个大矩阵或解决大规模优化问题来呈现分类规则——当维度超高时,这些方法在计算上是禁止的。随着大数据的出现,开发更高效的算法来解决高维LDA问题变得越来越重要。在本文中,我们提出了一种高效的贪婪搜索算法,该算法仅依赖于封闭形式的公式来学习高维LDA规则。从变量选择和错误率一致性两个方面建立了其统计性质的理论保证;此外,我们在一些温和的分布假设下,对LDA问题中由附加特征带来的额外信息提供了明确的解释。我们证明,与其他高维LDA方法相比,这种新算法大大提高了计算速度,同时保持了相当甚至更好的分类性能。
{"title":"An Efficient Greedy Search Algorithm for High-dimensional Linear Discriminant Analysis.","authors":"Hannan Yang,&nbsp;D Y Lin,&nbsp;Quefeng Li","doi":"10.5705/ss.202021.0028","DOIUrl":"https://doi.org/10.5705/ss.202021.0028","url":null,"abstract":"<p><p>High-dimensional classification is an important statistical problem that has applications in many areas. One widely used classifier is the Linear Discriminant Analysis (LDA). In recent years, many regularized LDA classifiers have been proposed to solve the problem of high-dimensional classification. However, these methods rely on inverting a large matrix or solving large-scale optimization problems to render classification rules-methods that are computationally prohibitive when the dimension is ultra-high. With the emergence of big data, it is increasingly important to develop more efficient algorithms to solve the high-dimensional LDA problem. In this paper, we propose an efficient greedy search algorithm that depends solely on closed-form formulae to learn a high-dimensional LDA rule. We establish theoretical guarantee of its statistical properties in terms of variable selection and error rate consistency; in addition, we provide an explicit interpretation of the extra information brought by an additional feature in a LDA problem under some mild distributional assumptions. We demonstrate that this new algorithm drastically improves computational speed compared with other high-dimensional LDA methods, while maintaining comparable or even better classification performance.</p>","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":"33 SI","pages":"1343-1364"},"PeriodicalIF":1.4,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10348717/pdf/nihms-1764480.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9847026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Marginal Bayesian Posterior Inference using Recurrent Neural Networks with Application to Sequential Models. 递归神经网络的边际贝叶斯后验推理及其在序列模型中的应用。
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-01 DOI: 10.5705/ss.202020.0348
Thayer Fisher, Alex Luedtke, Marco Carone, Noah Simon

In Bayesian data analysis, it is often important to evaluate quantiles of the posterior distribution of a parameter of interest (e.g., to form posterior intervals). In multi-dimensional problems, when non-conjugate priors are used, this is often difficult generally requiring either an analytic or sampling-based approximation, such as Markov chain Monte-Carlo (MCMC), Approximate Bayesian computation (ABC) or variational inference. We discuss a general approach that reframes this as a multi-task learning problem and uses recurrent deep neural networks (RNNs) to approximately evaluate posterior quantiles. As RNNs carry information along a sequence, this application is particularly useful in time-series. An advantage of this risk-minimization approach is that we do not need to sample from the posterior or calculate the likelihood. We illustrate the proposed approach in several examples.

在贝叶斯数据分析中,通常重要的是评估感兴趣参数的后验分布的分位数(例如,形成后验区间)。在多维问题中,当使用非共轭先验时,这通常是困难的,通常需要解析或基于抽样的近似,例如马尔可夫链蒙特卡罗(MCMC),近似贝叶斯计算(ABC)或变分推理。我们讨论了一种将其重新定义为多任务学习问题的一般方法,并使用循环深度神经网络(rnn)来近似评估后验分位数。由于rnn沿着序列携带信息,因此该应用程序在时间序列中特别有用。这种风险最小化方法的一个优点是我们不需要从后验中抽样或计算可能性。我们用几个例子来说明所提出的方法。
{"title":"Marginal Bayesian Posterior Inference using Recurrent Neural Networks with Application to Sequential Models.","authors":"Thayer Fisher,&nbsp;Alex Luedtke,&nbsp;Marco Carone,&nbsp;Noah Simon","doi":"10.5705/ss.202020.0348","DOIUrl":"https://doi.org/10.5705/ss.202020.0348","url":null,"abstract":"<p><p>In Bayesian data analysis, it is often important to evaluate quantiles of the posterior distribution of a parameter of interest (e.g., to form posterior intervals). In multi-dimensional problems, when non-conjugate priors are used, this is often difficult generally requiring either an analytic or sampling-based approximation, such as Markov chain Monte-Carlo (MCMC), Approximate Bayesian computation (ABC) or variational inference. We discuss a general approach that reframes this as a multi-task learning problem and uses recurrent deep neural networks (RNNs) to approximately evaluate posterior quantiles. As RNNs carry information along a sequence, this application is particularly useful in time-series. An advantage of this risk-minimization approach is that we do not need to sample from the posterior or calculate the likelihood. We illustrate the proposed approach in several examples.</p>","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":"33 SI","pages":"1507-1532"},"PeriodicalIF":1.4,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10321540/pdf/nihms-1807576.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10180986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical Inference for High-Dimensional Vector Autoregression with Measurement Error. 具有测量误差的高维向量自回归的统计推断
IF 1.5 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-01 DOI: 10.5705/ss.202021.0151
Xiang Lyu, Jian Kang, Lexin Li

High-dimensional vector autoregression with measurement error is frequently encountered in a large variety of scientific and business applications. In this article, we study statistical inference of the transition matrix under this model. While there has been a large body of literature studying sparse estimation of the transition matrix, there is a paucity of inference solutions, especially in the high-dimensional scenario. We develop inferential procedures for both the global and simultaneous testing of the transition matrix. We first develop a new sparse expectation-maximization algorithm to estimate the model parameters, and carefully characterize their estimation precisions. We then construct a Gaussian matrix, after proper bias and variance corrections, from which we derive the test statistics. Finally, we develop the testing procedures and establish their asymptotic guarantees. We study the finite-sample performance of our tests through intensive simulations, and illustrate with a brain connectivity analysis example.

具有测量误差的高维向量自回归在各种科学和商业应用中经常遇到。在本文中,我们研究了在这个模型下转移矩阵的统计推断。虽然有大量文献研究转移矩阵的稀疏估计,但推理解决方案很少,尤其是在高维场景中。我们为转移矩阵的全局和同时测试开发了推理程序。我们首先开发了一种新的稀疏期望最大化算法来估计模型参数,并仔细描述了它们的估计精度。然后,经过适当的偏差和方差校正,我们构造了一个高斯矩阵,从中我们得出了测试统计数据。最后,我们开发了测试程序,并建立了它们的渐近保证。我们通过深入的模拟研究了测试的有限样本性能,并以大脑连接分析为例进行了说明。
{"title":"Statistical Inference for High-Dimensional Vector Autoregression with Measurement Error.","authors":"Xiang Lyu, Jian Kang, Lexin Li","doi":"10.5705/ss.202021.0151","DOIUrl":"10.5705/ss.202021.0151","url":null,"abstract":"<p><p>High-dimensional vector autoregression with measurement error is frequently encountered in a large variety of scientific and business applications. In this article, we study statistical inference of the transition matrix under this model. While there has been a large body of literature studying sparse estimation of the transition matrix, there is a paucity of inference solutions, especially in the high-dimensional scenario. We develop inferential procedures for both the global and simultaneous testing of the transition matrix. We first develop a new sparse expectation-maximization algorithm to estimate the model parameters, and carefully characterize their estimation precisions. We then construct a Gaussian matrix, after proper bias and variance corrections, from which we derive the test statistics. Finally, we develop the testing procedures and establish their asymptotic guarantees. We study the finite-sample performance of our tests through intensive simulations, and illustrate with a brain connectivity analysis example.</p>","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":"1435-1459"},"PeriodicalIF":1.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623288/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44728518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Globally Adaptive Longitudinal Quantile Regression with High Dimensional Compositional Covariates. 高维组成协变量的全局自适应纵向分位数回归。
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-01 DOI: 10.5705/ss.202021.0006
Huijuan Ma, Qi Zheng, Zhumin Zhang, Huichuan Lai, Limin Peng

In this work, we propose a longitudinal quantile regression framework that enables a robust characterization of heterogeneous covariate-response associations in the presence of high-dimensional compositional covariates and repeated measurements of both response and covariates. We develop a globally adaptive penalization procedure, which can consistently identify covariate sparsity patterns across a continuum set of quantile levels. The proposed estimation procedure properly aggregates longitudinal observations over time, and ensures the satisfaction of the sum-zero coefficient constraint that is needed for proper interpretation of the effects of compositional covariates. We establish the oracle rate of uniform convergence and weak convergence of the resulting estimators, and further justify the proposed uniform selector of the tuning parameter in terms of achieving global model selection consistency. We derive an efficient algorithm by incorporating existing R packages to facilitate stable and fast computation. Our extensive simulation studies confirm the theoretical findings. We apply the proposed method to a longitudinal study of cystic fibrosis children where the association between gut microbiome and other diet-related biomarkers is of interest.

在这项工作中,我们提出了一个纵向分位数回归框架,该框架能够在高维组成协变量和响应和协变量的重复测量中对异质协变量-响应关联进行稳健表征。我们开发了一个全局自适应的惩罚程序,它可以在连续的分位数水平上一致地识别协变量稀疏性模式。所提出的估计程序适当地汇总了随时间推移的纵向观测,并确保满足零和系数约束,这是正确解释组成协变量影响所需的。我们建立了估计量的一致收敛率和弱收敛率,并进一步从实现全局模型选择一致性的角度证明了所提出的调谐参数的一致选择。我们结合现有的R包推导出一种高效的算法,以促进稳定和快速的计算。我们广泛的模拟研究证实了这些理论发现。我们将提出的方法应用于囊性纤维化儿童的纵向研究,其中肠道微生物组和其他饮食相关生物标志物之间的关联是感兴趣的。
{"title":"Globally Adaptive Longitudinal Quantile Regression with High Dimensional Compositional Covariates.","authors":"Huijuan Ma,&nbsp;Qi Zheng,&nbsp;Zhumin Zhang,&nbsp;Huichuan Lai,&nbsp;Limin Peng","doi":"10.5705/ss.202021.0006","DOIUrl":"https://doi.org/10.5705/ss.202021.0006","url":null,"abstract":"<p><p>In this work, we propose a longitudinal quantile regression framework that enables a robust characterization of heterogeneous covariate-response associations in the presence of high-dimensional compositional covariates and repeated measurements of both response and covariates. We develop a globally adaptive penalization procedure, which can consistently identify covariate sparsity patterns across a continuum set of quantile levels. The proposed estimation procedure properly aggregates longitudinal observations over time, and ensures the satisfaction of the sum-zero coefficient constraint that is needed for proper interpretation of the effects of compositional covariates. We establish the oracle rate of uniform convergence and weak convergence of the resulting estimators, and further justify the proposed uniform selector of the tuning parameter in terms of achieving global model selection consistency. We derive an efficient algorithm by incorporating existing R packages to facilitate stable and fast computation. Our extensive simulation studies confirm the theoretical findings. We apply the proposed method to a longitudinal study of cystic fibrosis children where the association between gut microbiome and other diet-related biomarkers is of interest.</p>","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":"33 Spec","pages":"1295-1318"},"PeriodicalIF":1.4,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10361693/pdf/nihms-1757788.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9862958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources 子采样和折刀:计算资源有限的大数据分析的一种实用方便的解决方案
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-13 DOI: 10.5705/ss.202021.0257
Shuyuan Wu, Xuening Zhu, Hansheng Wang
Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.
现代统计分析经常遇到大数据集。对于这些数据集,传统的估计方法很难立即使用,因为从业者经常受到计算资源有限的困扰。在大多数情况下,它们没有强大的计算资源(例如Hadoop或Spark)。如何在有限的计算资源下对大型数据集进行实际的分析就成为一个非常重要的问题。为了解决这一问题,我们提出了一种新的基于次采样的jackknife方法。关键思想是把整个样本数据当作总体来对待。然后,采用简单随机抽样带替换的方法,得到尺寸大大减小的多个子样本。值得注意的是,我们不建议不进行替换的抽样方法,因为这将导致硬盘上数据处理的巨大成本。如果数据在内存中处理,则不存在这种开销。由于次采样数据的大小相对较小,因此它们可以作为一个整体轻松地读入计算机存储器,然后很容易地进行处理。基于下采样数据集,可以得到目标参数的jackknife-debiased估计量。所得的估计量在统计上是一致的,偏差极小。最后,对来自不同子样本的jackknife-debiased估计量进行平均,形成最终估计量。我们从理论上证明了最终估计量是一致的和渐近正态的。在非常温和的条件下,它的渐近统计效率可与全样本估计器的统计效率相当。所提出的方法非常简单,易于在大多数实际的计算机系统上实现,因此应该具有非常广泛的适用性。
{"title":"Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources","authors":"Shuyuan Wu, Xuening Zhu, Hansheng Wang","doi":"10.5705/ss.202021.0257","DOIUrl":"https://doi.org/10.5705/ss.202021.0257","url":null,"abstract":"Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48682548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction 高维充分降维中的无切片逆回归
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-13 DOI: 10.5705/ss.202022.0112
Qing Mai, X. Shao, Runmin Wang, Xin Zhang
Sliced inverse regression (SIR, Li 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee&Shao 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.
切片逆回归(SIR, Li, 1991)是一项开创性的工作,也是最被认可的充分降维方法。虽然在高维SIR的理论和方法方面取得了可喜的进展,但仍有两个挑战困扰着高维多变量SIR的应用。首先,在SIR中选择切片的数量是一个难题,它取决于样本量、变量分布和其他实际考虑因素。其次,SIR从单变量响应到多变量响应的扩展并非微不足道。针对与SIR相同的降维子空间,我们提出了一种新的无切片方法,该方法提供了具有高维协变量和单变量或多变量响应的充分降维的统一解。我们通过采用最近开发的鞅差分散度矩阵(MDDM, Lee&Shao 2018)和惩罚特征分解算法来实现这一点。为了建立我们的方法与高维预测器和多变量响应的一致性,我们使用u统计理论为样本MDDM在其人口对应物周围建立了一个新的浓度不等式,这可能是独立的兴趣。仿真和实际数据分析表明,该方法具有良好的有限样本性能。
{"title":"Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction","authors":"Qing Mai, X. Shao, Runmin Wang, Xin Zhang","doi":"10.5705/ss.202022.0112","DOIUrl":"https://doi.org/10.5705/ss.202022.0112","url":null,"abstract":"Sliced inverse regression (SIR, Li 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee&Shao 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41485273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Distributed Logistic Regression for Massive Data with Rare Events 具有罕见事件的海量数据的分布式逻辑回归
IF 1.4 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-05 DOI: 10.5705/ss.202022.0242
Xia Li, Xuening Zhu, Hansheng Wang
Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.
大规模罕见事件数据在实践中经常遇到。为了处理大量的罕见事件数据,我们提出了一种新的分布式系统中逻辑回归的分布式估计方法。对于分布式框架,我们面临以下两个挑战。第一个挑战是如何分发数据。在这方面,研究了两种不同的分发策略(即随机策略和复制策略)。第二个挑战是如何选择合适类型的目标函数,以便达到最佳的渐近效率。然后,考虑了欠采样(US)和逆概率加权(IPW)类型的目标函数。我们的结果表明,COPY策略和IPW目标函数是具有罕见事件的分布式逻辑回归的最佳解决方案。仿真研究和真实世界的瑞典交通标志数据集证明了分布式方法的有限样本性能。
{"title":"Distributed Logistic Regression for Massive Data with Rare Events","authors":"Xia Li, Xuening Zhu, Hansheng Wang","doi":"10.5705/ss.202022.0242","DOIUrl":"https://doi.org/10.5705/ss.202022.0242","url":null,"abstract":"Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.","PeriodicalId":49478,"journal":{"name":"Statistica Sinica","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45849352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
Statistica Sinica
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1