首页 > 最新文献

Australian & New Zealand Journal of Statistics最新文献

英文 中文
Robust PCA for high-dimensional data based on characteristic transformation 基于特征变换的高维数据鲁棒主成分分析
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-06-13 DOI: 10.1111/anzs.12385
Lingyu He, Yanrong Yang, Bo Zhang

In this paper, we propose a novel robust principal component analysis (PCA) for high-dimensional data in the presence of various heterogeneities, in particular strong tailing and outliers. A transformation motivated by the characteristic function is constructed to improve the robustness of the classical PCA. The suggested method has the distinct advantage of dealing with heavy-tail-distributed data, whose covariances may be non-existent (positively infinite, for instance), in addition to the usual outliers. The proposed approach is also a case of kernel principal component analysis (KPCA) and employs the robust and non-linear properties via a bounded and non-linear kernel function. The merits of the new method are illustrated by some statistical properties, including the upper bound of the excess error and the behaviour of the large eigenvalues under a spiked covariance model. Additionally, using a variety of simulations, we demonstrate the benefits of our approach over the classical PCA. Finally, using data on protein expression in mice of various genotypes in a biological study, we apply the novel robust PCA to categorise the mice and find that our approach is more effective at identifying abnormal mice than the classical PCA.

在本文中,我们提出了一种新的鲁棒主成分分析(PCA),用于存在各种异质性,特别是强拖尾和异常值的高维数据。构造了一个由特征函数驱动的变换,以提高经典PCA的鲁棒性。所提出的方法在处理重尾分布数据方面具有明显的优势,除了通常的异常值外,这些数据的协变量可能不存在(例如,正无限)。所提出的方法也是核主成分分析(KPCA)的一个例子,并通过有界和非线性核函数利用了鲁棒和非线性特性。新方法的优点通过一些统计特性来说明,包括超额误差的上界和大特征值在尖峰协方差模型下的行为。此外,通过各种模拟,我们展示了我们的方法相对于经典PCA的优势。最后,在一项生物学研究中,利用不同基因型小鼠蛋白质表达的数据,我们应用新的稳健PCA对小鼠进行分类,发现我们的方法在识别异常小鼠方面比经典PCA更有效。
{"title":"Robust PCA for high-dimensional data based on characteristic transformation","authors":"Lingyu He,&nbsp;Yanrong Yang,&nbsp;Bo Zhang","doi":"10.1111/anzs.12385","DOIUrl":"https://doi.org/10.1111/anzs.12385","url":null,"abstract":"<div>\u0000 \u0000 <p>In this paper, we propose a novel robust principal component analysis (PCA) for high-dimensional data in the presence of various heterogeneities, in particular strong tailing and outliers. A transformation motivated by the characteristic function is constructed to improve the robustness of the classical PCA. The suggested method has the distinct advantage of dealing with heavy-tail-distributed data, whose covariances may be non-existent (positively infinite, for instance), in addition to the usual outliers. The proposed approach is also a case of kernel principal component analysis (KPCA) and employs the robust and non-linear properties via a bounded and non-linear kernel function. The merits of the new method are illustrated by some statistical properties, including the upper bound of the excess error and the behaviour of the large eigenvalues under a spiked covariance model. Additionally, using a variety of simulations, we demonstrate the benefits of our approach over the classical PCA. Finally, using data on protein expression in mice of various genotypes in a biological study, we apply the novel robust PCA to categorise the mice and find that our approach is more effective at identifying abnormal mice than the classical PCA.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50150434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian neural tree models for nonparametric regression 非参数回归的贝叶斯神经树模型
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-06-12 DOI: 10.1111/anzs.12386
Tanujit Chakraborty, Gauri Kamat, Ashis Kumar Chakraborty

Frequentist and Bayesian methods differ in many aspects but share some basic optimal properties. In real-life prediction problems, situations exist in which a model based on one of the above paradigms is preferable depending on some subjective criteria. Nonparametric classification and regression techniques, such as decision trees and neural networks, have both frequentist (classification and regression trees (CARTs) and artificial neural networks) as well as Bayesian counterparts (Bayesian CART and Bayesian neural networks) to learning from data. In this paper, we present two hybrid models combining the Bayesian and frequentist versions of CART and neural networks, which we call the Bayesian neural tree (BNT) models. BNT models can simultaneously perform feature selection and prediction, are highly flexible, and generalise well in settings with limited training observations. We study the statistical consistency of the proposed approaches and derive the optimal value of a vital model parameter. The excellent performance of the newly proposed BNT models is shown using simulation studies. We also provide some illustrative examples using a wide variety of standard regression datasets from a public available machine learning repository to show the superiority of the proposed models in comparison to popularly used Bayesian CART and Bayesian neural network models.

Frequencist和Bayesian方法在许多方面不同,但有一些基本的最优性质。在现实生活中的预测问题中,根据一些主观标准,存在基于上述范式之一的模型是优选的情况。非参数分类和回归技术,如决策树和神经网络,既有频率学家(分类和回归树(CART)和人工神经网络),也有贝叶斯对应物(贝叶斯CART和贝叶斯神经网络)来从数据中学习。在本文中,我们提出了两个混合模型,结合了CART和神经网络的贝叶斯和频率论版本,我们称之为贝叶斯神经树(BNT)模型。BNT模型可以同时执行特征选择和预测,具有高度灵活性,并且在训练观察有限的环境中具有良好的泛化能力。我们研究了所提出方法的统计一致性,并导出了重要模型参数的最优值。仿真研究表明,新提出的BNT模型具有良好的性能。我们还使用来自公共机器学习库的各种标准回归数据集提供了一些说明性示例,以显示所提出的模型与常用的贝叶斯CART和贝叶斯神经网络模型相比的优越性。
{"title":"Bayesian neural tree models for nonparametric regression","authors":"Tanujit Chakraborty,&nbsp;Gauri Kamat,&nbsp;Ashis Kumar Chakraborty","doi":"10.1111/anzs.12386","DOIUrl":"https://doi.org/10.1111/anzs.12386","url":null,"abstract":"<div>\u0000 \u0000 <p>Frequentist and Bayesian methods differ in many aspects but share some basic optimal properties. In real-life prediction problems, situations exist in which a model based on one of the above paradigms is preferable depending on some subjective criteria. Nonparametric classification and regression techniques, such as decision trees and neural networks, have both frequentist (classification and regression trees (CARTs) and artificial neural networks) as well as Bayesian counterparts (Bayesian CART and Bayesian neural networks) to learning from data. In this paper, we present two hybrid models combining the Bayesian and frequentist versions of CART and neural networks, which we call the Bayesian neural tree (BNT) models. BNT models can simultaneously perform feature selection and prediction, are highly flexible, and generalise well in settings with limited training observations. We study the statistical consistency of the proposed approaches and derive the optimal value of a vital model parameter. The excellent performance of the newly proposed BNT models is shown using simulation studies. We also provide some illustrative examples using a wide variety of standard regression datasets from a public available machine learning repository to show the superiority of the proposed models in comparison to popularly used Bayesian CART and Bayesian neural network models.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50139340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A nonparametric mixture approach to density and null proportion estimation in large-scale multiple comparison problems 大规模多重比较问题中密度和零比估计的非参数混合方法
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-04-04 DOI: 10.1111/anzs.12383
Xiangjie Xue, Yong Wang

A new method for estimating the proportion of null effects is proposed for solving large-scale multiple comparison problems. It utilises maximum likelihood estimation of nonparametric mixtures, which also provides a density estimate of the test statistics. It overcomes the problem of the usual nonparametric maximum likelihood estimator that cannot produce a positive probability at the location of null effects in the process of estimating nonparametrically a mixing distribution. The profile likelihood is further used to help produce a range of null proportion values, corresponding to which the density estimates are all consistent. With a proper choice of a threshold function on the profile likelihood ratio, the upper endpoint of this range can be shown to be a consistent estimator of the null proportion. Numerical studies show that the proposed method has an apparently convergent trend in all cases studied and performs favourably when compared with existing methods in the literature.

针对大规模多重比较问题,提出了一种估计零效应比例的新方法。它利用了非参数混合物的最大似然估计,这也提供了测试统计的密度估计。它克服了通常的非参数最大似然估计在非帧估计混合分布的过程中不能在零效应位置产生正概率的问题。轮廓似然性被进一步用于帮助产生零比例值的范围,对应于该范围的密度估计都是一致的。在轮廓似然比上适当选择阈值函数的情况下,该范围的上限可以被证明是零比例的一致估计器。数值研究表明,所提出的方法在所研究的所有情况下都有明显的收敛趋势,与文献中现有的方法相比表现良好。
{"title":"A nonparametric mixture approach to density and null proportion estimation in large-scale multiple comparison problems","authors":"Xiangjie Xue,&nbsp;Yong Wang","doi":"10.1111/anzs.12383","DOIUrl":"https://doi.org/10.1111/anzs.12383","url":null,"abstract":"<p>A new method for estimating the proportion of null effects is proposed for solving large-scale multiple comparison problems. It utilises maximum likelihood estimation of nonparametric mixtures, which also provides a density estimate of the test statistics. It overcomes the problem of the usual nonparametric maximum likelihood estimator that cannot produce a positive probability at the location of null effects in the process of estimating nonparametrically a mixing distribution. The profile likelihood is further used to help produce a range of null proportion values, corresponding to which the density estimates are all consistent. With a proper choice of a threshold function on the profile likelihood ratio, the upper endpoint of this range can be shown to be a consistent estimator of the null proportion. Numerical studies show that the proposed method has an apparently convergent trend in all cases studied and performs favourably when compared with existing methods in the literature.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50119875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A method to reduce the width of confidence intervals by using a normal scores transformation 一种利用正态分数变换减小置信区间宽度的方法
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-03-17 DOI: 10.1111/anzs.12384
T. W. O’Gorman

In stating the results of their research, scientists usually want to publish narrow confidence intervals because they give precise estimates of the effects of interest. In many cases, the researcher would want to use the narrowest interval that maintains the desired coverage probability. In this manuscript, we propose a new method of finding confidence intervals that are often narrower than traditional confidence intervals for any individual parameter in a linear model if the errors are from a skewed distribution or from a long-tailed symmetric distribution. If the errors are normally distributed, we show that the width of the proposed normal scores confidence interval will not be much greater than the width of the traditional interval. If the dataset includes predictor variables that are uncorrelated or moderately correlated then the confidence intervals will maintain their coverage probability. However, if the covariates are highly correlated, then the coverage probability of the proposed confidence interval may be slightly lower than the nominal value. The procedure is not computationally intensive and an R program is available to determine the normal scores 95% confidence interval. Whenever the covariates are not highly correlated, the normal scores confidence interval is recommended for the analysis of datasets having 50 or more observations.

在陈述研究结果时,科学家通常希望公布狭窄的置信区间,因为他们对感兴趣的影响给出了精确的估计。在许多情况下,研究人员希望使用最窄的区间来保持所需的覆盖概率。在这篇文章中,我们提出了一种新的方法来寻找置信区间,如果误差来自偏斜分布或长尾对称分布,则对于线性模型中的任何单个参数,置信区间通常比传统的置信区间窄。如果误差是正态分布的,我们表明所提出的正态分数置信区间的宽度不会比传统区间的宽度大多少。如果数据集包括不相关或适度相关的预测变量,则置信区间将保持其覆盖概率。然而,如果协变量高度相关,那么所提出的置信区间的覆盖概率可能略低于标称值。该过程不是计算密集型的,并且R程序可用于确定95%置信区间的正常分数。每当协变量不高度相关时,建议使用正态分数置信区间来分析具有50个或更多观测值的数据集。
{"title":"A method to reduce the width of confidence intervals by using a normal scores transformation","authors":"T. W. O’Gorman","doi":"10.1111/anzs.12384","DOIUrl":"https://doi.org/10.1111/anzs.12384","url":null,"abstract":"<div>\u0000 \u0000 <p>In stating the results of their research, scientists usually want to publish narrow confidence intervals because they give precise estimates of the effects of interest. In many cases, the researcher would want to use the narrowest interval that maintains the desired coverage probability. In this manuscript, we propose a new method of finding confidence intervals that are often narrower than traditional confidence intervals for any individual parameter in a linear model if the errors are from a skewed distribution or from a long-tailed symmetric distribution. If the errors are normally distributed, we show that the width of the proposed normal scores confidence interval will not be much greater than the width of the traditional interval. If the dataset includes predictor variables that are uncorrelated or moderately correlated then the confidence intervals will maintain their coverage probability. However, if the covariates are highly correlated, then the coverage probability of the proposed confidence interval may be slightly lower than the nominal value. The procedure is not computationally intensive and an R program is available to determine the normal scores 95% confidence interval. Whenever the covariates are not highly correlated, the normal scores confidence interval is recommended for the analysis of datasets having 50 or more observations.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50136144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable selection in heterogeneous panel data models with cross-sectional dependence 具有截面相关性的异质面板数据模型中的变量选择
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-02-15 DOI: 10.1111/anzs.12381
Xiaoling Mei, Bin Peng, Huanjun Zhu

This paper studies the Bridge estimator for a high-dimensional panel data model with heterogeneous varying coefficients, where the random errors are assumed to be serially correlated and cross-sectionally dependent. We establish oracle efficiency and the asymptotic distribution of the Bridge estimator, when the number of covariates increases to infinity with the sample size in both dimensions. A BIC-type criterion is also provided for tuning parameter selection. We further generalise the marginal Bridge estimator for our model to asymptotically correctly identify the covariates with zero coefficients even when the number of covariates is greater than the sample size under a partial orthogonality condition. The finite sample performance of the proposed estimator is demonstrated by simulated data examples, and an empirical application with the US stock dataset is also provided.

本文研究了具有异质变系数的高维面板数据模型的Bridge估计量,其中随机误差被假设为序列相关和截面相关。当协变量的数量随着两个维度上的样本量增加到无穷大时,我们建立了Bridge估计量的预言效率和渐近分布。还提供了用于调谐参数选择的BIC类型标准。我们进一步推广了我们模型的边际桥估计量,以渐近正确地识别具有零系数的协变量,即使在部分正交性条件下,协变量的数量大于样本量。通过模拟数据实例证明了该估计器的有限样本性能,并提供了美国股市数据集的实证应用。
{"title":"Variable selection in heterogeneous panel data models with cross-sectional dependence","authors":"Xiaoling Mei,&nbsp;Bin Peng,&nbsp;Huanjun Zhu","doi":"10.1111/anzs.12381","DOIUrl":"https://doi.org/10.1111/anzs.12381","url":null,"abstract":"<p>This paper studies the Bridge estimator for a high-dimensional panel data model with heterogeneous varying coefficients, where the random errors are assumed to be serially correlated and cross-sectionally dependent. We establish oracle efficiency and the asymptotic distribution of the Bridge estimator, when the number of covariates increases to infinity with the sample size in both dimensions. A BIC-type criterion is also provided for tuning parameter selection. We further generalise the marginal Bridge estimator for our model to asymptotically correctly identify the covariates with zero coefficients even when the number of covariates is greater than the sample size under a partial orthogonality condition. The finite sample performance of the proposed estimator is demonstrated by simulated data examples, and an empirical application with the US stock dataset is also provided.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50150998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On two conjectures about perturbations of the stochastic growth rate 关于随机增长率扰动的两个猜想
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-02-15 DOI: 10.1111/anzs.12382
Stefano Giaimo

The stochastic growth rate describes long-run growth of a population that lives in a fluctuating environment. Perturbation analysis of the stochastic growth rate provides crucial information for population managers, ecologists and evolutionary biologists. This analysis quantifies the response of the stochastic growth rate to changes in demographic parameters. A form of this analysis deals with changes that only occur in some environmental states. Caswell put forth two conjectures about environment-specific perturbations of the stochastic growth rate. The conjectures link the stationary distribution of the stochastic environmental process with the magnitude of some environment-specific perturbations. This note disproves one conjecture and proves the other.

随机增长率描述了生活在波动环境中的人口的长期增长。随机增长率的扰动分析为种群管理者、生态学家和进化生物学家提供了重要信息。该分析量化了随机增长率对人口统计参数变化的响应。这种分析的一种形式是处理只在某些环境状态下发生的变化。Caswell提出了两个关于随机增长率的环境特定扰动的猜想。这些猜想将随机环境过程的平稳分布与某些特定环境扰动的大小联系起来。这个注释推翻了一个猜想,也证明了另一个猜想。
{"title":"On two conjectures about perturbations of the stochastic growth rate","authors":"Stefano Giaimo","doi":"10.1111/anzs.12382","DOIUrl":"https://doi.org/10.1111/anzs.12382","url":null,"abstract":"<p>The stochastic growth rate describes long-run growth of a population that lives in a fluctuating environment. Perturbation analysis of the stochastic growth rate provides crucial information for population managers, ecologists and evolutionary biologists. This analysis quantifies the response of the stochastic growth rate to changes in demographic parameters. A form of this analysis deals with changes that only occur in some environmental states. Caswell put forth two conjectures about environment-specific perturbations of the stochastic growth rate. The conjectures link the stationary distribution of the stochastic environmental process with the magnitude of some environment-specific perturbations. This note disproves one conjecture and proves the other.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12382","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50150997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Richards growth model to predict fruit weight 预测水果重量的理查兹生长模型
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2023-01-05 DOI: 10.1111/anzs.12380
Daniel Gerhard, Elena Moltchanova

The Richards model comprises several popular sigmoidal and monomolecular growth curves. We illustrate fitting of a Bayesian Richards model by splitting the full growth model into several submodels, followed by a model selection procedure. The performance of the methodology is evaluated by Monte Carlo simulations. A double-sigmoidal version of the Richards model is applied to model grape bunch weight based on data from a New Zealand vineyard over a single growing period.

A Bayesian Richards growth model applied to grape size data. Representations of phenological processes are selected through multi-model inference.

理查兹模型包括几种流行的s型和单分子生长曲线。我们通过将完整的增长模型分成几个子模型来说明贝叶斯理查兹模型的拟合,然后是模型选择过程。通过蒙特卡洛仿真对该方法的性能进行了评价。理查兹模型的双s型版本应用于基于新西兰葡萄园单一生长时期数据的葡萄串重量模型。贝叶斯理查兹生长模型应用于葡萄大小数据。物候过程的表征是通过多模型推理来选择的。
{"title":"A Richards growth model to predict fruit weight","authors":"Daniel Gerhard,&nbsp;Elena Moltchanova","doi":"10.1111/anzs.12380","DOIUrl":"10.1111/anzs.12380","url":null,"abstract":"<p>The Richards model comprises several popular sigmoidal and monomolecular growth curves. We illustrate fitting of a Bayesian Richards model by splitting the full growth model into several submodels, followed by a model selection procedure. The performance of the methodology is evaluated by Monte Carlo simulations. A double-sigmoidal version of the Richards model is applied to model grape bunch weight based on data from a New Zealand vineyard over a single growing period.</p><p>A Bayesian Richards growth model applied to grape size data. Representations of phenological processes are selected through multi-model inference.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anzs.12380","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77550644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Minimum cost-compression risk in principal component analysis 主成分分析中的最小成本压缩风险
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2022-12-28 DOI: 10.1111/anzs.12378
Bhargab Chattopadhyay, Swarnali Banerjee

Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.

主成分分析(PCA)是一种流行的多元分析工具,它可以在不丢失太多信息的情况下进行降维。包含大量顺序到达的特征的数据向量可能彼此相关。在线PCA是一种有效的算法。现有的在线PCA研究工作围绕着提出有效的可扩展更新算法,只关注压缩损失。它们没有考虑数据集的大小,数据向量的进一步到达可以被终止,并且可以应用降维。众所周知,数据集大小有助于减少压缩损失——数据集大小越小,压缩损失越大,而数据集大小越大,压缩损失越小。然而,通过增加数据集大小来减少压缩损失将增加总数据收集成本。在本文中,我们超越了与在线PCA相关的可扩展性和更新问题,并专注于优化考虑压缩损失和数据收集成本的成本-压缩损失。我们使用两阶段PCA算法最小化相应的风险。所得到的两阶段算法是一种快速而有效的在线PCA替代方案,并且在不假设特定数据分布的情况下显示出有吸引力的收敛特性。实验研究表明了类似的结果,并利用实际数据提供了进一步的说明。作为扩展,本文还讨论了一种多阶段PCA算法。考虑到在线数据的时间复杂度,两阶段主成分分析算法比多阶段主成分分析算法更受重视。
{"title":"Minimum cost-compression risk in principal component analysis","authors":"Bhargab Chattopadhyay,&nbsp;Swarnali Banerjee","doi":"10.1111/anzs.12378","DOIUrl":"10.1111/anzs.12378","url":null,"abstract":"<div>\u0000 \u0000 <p>Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82020722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new minification integer-valued autoregressive process driven by explanatory variables 一种新的由解释变量驱动的最小化整数值自回归过程
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2022-12-28 DOI: 10.1111/anzs.12379
Lianyong Qian, Fukang Zhu

The discrete minification model based on the modified negative binomial operator, as an extension to the continuous minification model, can be used to describe an extreme value after few increasing values. To make this model more practical and flexible, a new minification integer-valued autoregressive process driven by explanatory variables is proposed. Ergodicity of the new process is discussed. The estimators of the unknown parameters are obtained via the conditional least squares and conditional maximum likelihood methods, and the asymptotic properties are also established. A testing procedure for checking existence of the explanatory variables is developed. Some Monte Carlo simulations are given to illustrate the finite-sample performances of the estimators under specification and misspecification and the test, respectively. A real example is applied to illustrate the performance of our model.

基于修正负二项式算子的离散最小化模型,作为连续最小化模型的扩展,可以用来描述少量增量后的极值。为了使该模型更加实用和灵活,提出了一种新的由解释变量驱动的最小化整值自回归过程。讨论了新工艺的遍历性。通过条件最小二乘和条件极大似然方法得到了未知参数的估计量,并建立了未知参数的渐近性质。开发了检验解释变量是否存在的检验程序。通过蒙特卡罗仿真分别说明了该估计器在规范和不规范情况下的有限样本性能和测试结果。最后用一个实例说明了该模型的性能。
{"title":"A new minification integer-valued autoregressive process driven by explanatory variables","authors":"Lianyong Qian,&nbsp;Fukang Zhu","doi":"10.1111/anzs.12379","DOIUrl":"10.1111/anzs.12379","url":null,"abstract":"<div>\u0000 \u0000 <p>The discrete minification model based on the modified negative binomial operator, as an extension to the continuous minification model, can be used to describe an extreme value after few increasing values. To make this model more practical and flexible, a new minification integer-valued autoregressive process driven by explanatory variables is proposed. Ergodicity of the new process is discussed. The estimators of the unknown parameters are obtained via the conditional least squares and conditional maximum likelihood methods, and the asymptotic properties are also established. A testing procedure for checking existence of the explanatory variables is developed. Some Monte Carlo simulations are given to illustrate the finite-sample performances of the estimators under specification and misspecification and the test, respectively. A real example is applied to illustrate the performance of our model.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82225959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Small area estimation under a semi-parametric covariate measured with error 半参数协变量测量误差下的小面积估计
IF 1.1 4区 数学 Q3 Mathematics Pub Date : 2022-12-08 DOI: 10.1111/anzs.12377
Reyhane Sefidkar, Mahmoud Torabi, Amir Kavousi

In recent years, small area estimation has played an important role in statistics as it deals with the problem of obtaining reliable estimates for parameters of interest in areas with small or even zero sample sizes corresponding to population sizes. Nested error linear regression models are often used in small area estimation assuming that the covariates are measured without error and also the relationship between covariates and response variable is linear. Small area models have also been extended to the case in which a linear relationship may not hold, using penalised spline (P-spline) regression, but assuming that the covariates are measured without error. Recently, a nested error regression model using a P-spline regression model, for the fixed part of the model, has been studied assuming the presence of measurement error in covariate, in the Bayesian framework. In this paper, we propose a frequentist approach to study a semi-parametric nested error regression model using P-splines with a covariate measured with error. In particular, the pseudo-empirical best predictors of small area means and their corresponding mean squared prediction error estimates are studied. Performance of the proposed approach is evaluated through a simulation and also by a real data application. We propose a frequentist approach to study a semi-parametric nested error regression model using P-splines with a covariate measured with error.

近年来,小面积估计在统计学中发挥了重要作用,因为它处理的是在与人口规模相对应的小样本甚至为零的区域中获得感兴趣参数的可靠估计的问题。嵌套误差线性回归模型常用于小面积估计,假设协变量测量无误差,且协变量与响应变量之间呈线性关系。小面积模型也被扩展到线性关系可能不成立的情况下,使用惩罚样条(p样条)回归,但假设协变量的测量没有误差。本文研究了在贝叶斯框架下,假设协变量中存在测量误差,采用p样条回归模型对模型的固定部分建立嵌套误差回归模型。在本文中,我们提出了一种频率论方法来研究一个半参数嵌套误差回归模型,该模型使用带有误差测量协变量的p样条。特别研究了小面积均值的伪经验最佳预测因子及其相应的均方预测误差估计。通过仿真和实际数据应用对该方法的性能进行了评价。我们提出了一种频率论方法来研究一个半参数嵌套误差回归模型,该模型使用带有误差测量协变量的p样条。
{"title":"Small area estimation under a semi-parametric covariate measured with error","authors":"Reyhane Sefidkar,&nbsp;Mahmoud Torabi,&nbsp;Amir Kavousi","doi":"10.1111/anzs.12377","DOIUrl":"10.1111/anzs.12377","url":null,"abstract":"<div>\u0000 \u0000 <p>In recent years, small area estimation has played an important role in statistics as it deals with the problem of obtaining reliable estimates for parameters of interest in areas with small or even zero sample sizes corresponding to population sizes. Nested error linear regression models are often used in small area estimation assuming that the covariates are measured without error and also the relationship between covariates and response variable is linear. Small area models have also been extended to the case in which a linear relationship may not hold, using penalised spline (P-spline) regression, but assuming that the covariates are measured without error. Recently, a nested error regression model using a P-spline regression model, for the fixed part of the model, has been studied assuming the presence of measurement error in covariate, in the Bayesian framework. In this paper, we propose a frequentist approach to study a semi-parametric nested error regression model using P-splines with a covariate measured with error. In particular, the pseudo-empirical best predictors of small area means and their corresponding mean squared prediction error estimates are studied. Performance of the proposed approach is evaluated through a simulation and also by a real data application. We propose a frequentist approach to study a semi-parametric nested error regression model using P-splines with a covariate measured with error.</p>\u0000 </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89503682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Australian & New Zealand Journal of Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1