首页 > 最新文献

Computational Statistics最新文献

英文 中文
A new approach to nonparametric estimation of multivariate spectral density function using basis expansion 利用基扩展对多元谱密度函数进行非参数估计的新方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-20 DOI: 10.1007/s00180-023-01451-4
Shirin Nezampour, Alireza Nematollahi, Robert T. Krafty, Mehdi Maadooliat

This paper develops a nonparametric method for estimating the spectral density of multivariate stationary time series using basis expansion. A likelihood-based approach is used to fit the model through the minimization of a penalized Whittle negative log-likelihood. Then, a Newton-type algorithm is developed for the computation. In this method, we smooth the Cholesky factors of the multivariate spectral density matrix in a way that the reconstructed estimate based on the smoothed Cholesky components is consistent and positive-definite. In a simulation study, we have illustrated and compared our proposed method with other competitive approaches. Finally, we apply our approach to two real-world problems, Electroencephalogram signals analysis, (El Nitilde{n}o) Cycle.

本文开发了一种非参数方法,利用基扩展估计多元静态时间序列的谱密度。本文采用基于似然法的方法,通过最小化惩罚惠特尔负对数似然来拟合模型。然后,为计算开发了一种牛顿型算法。在这种方法中,我们对多元谱密度矩阵的 Cholesky 因子进行平滑处理,使基于平滑 Cholesky 分量的重建估计值具有一致性和正有限性。在模拟研究中,我们对所提出的方法进行了说明,并与其他竞争方法进行了比较。最后,我们将我们的方法应用于两个现实世界的问题:脑电信号分析、(El Nitilde{n}o )循环。
{"title":"A new approach to nonparametric estimation of multivariate spectral density function using basis expansion","authors":"Shirin Nezampour, Alireza Nematollahi, Robert T. Krafty, Mehdi Maadooliat","doi":"10.1007/s00180-023-01451-4","DOIUrl":"https://doi.org/10.1007/s00180-023-01451-4","url":null,"abstract":"<p>This paper develops a nonparametric method for estimating the spectral density of multivariate stationary time series using basis expansion. A likelihood-based approach is used to fit the model through the minimization of a penalized Whittle negative log-likelihood. Then, a Newton-type algorithm is developed for the computation. In this method, we smooth the Cholesky factors of the multivariate spectral density matrix in a way that the reconstructed estimate based on the smoothed Cholesky components is consistent and positive-definite. In a simulation study, we have illustrated and compared our proposed method with other competitive approaches. Finally, we apply our approach to two real-world problems, Electroencephalogram signals analysis, <span>(El Nitilde{n}o)</span> Cycle.\u0000</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139508567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Censored broken adaptive ridge regression in high-dimension 高维度矢量破碎自适应脊回归
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-17 DOI: 10.1007/s00180-023-01446-1
Jeongjin Lee, Taehwa Choi, Sangbum Choi

Broken adaptive ridge (BAR) is a penalized regression method that performs variable selection via a computationally scalable surrogate to (L_0) regularization. The BAR regression has many appealing features; it converges to selection with (L_0) penalties as a result of reweighting (L_2) penalties, and satisfies the oracle property with grouping effect for highly correlated covariates. In this paper, we investigate the BAR procedure for variable selection in a semiparametric accelerated failure time model with complex high-dimensional censored data. Coupled with Buckley-James-type responses, BAR-based variable selection procedures can be performed when event times are censored in complex ways, such as right-censored, left-censored, or double-censored. Our approach utilizes a two-stage cyclic coordinate descent algorithm to minimize the objective function by iteratively estimating the pseudo survival response and regression coefficients along the direction of coordinates. Under some weak regularity conditions, we establish both the oracle property and the grouping effect of the proposed BAR estimator. Numerical studies are conducted to investigate the finite-sample performance of the proposed algorithm and an application to real data is provided as a data example.

断裂自适应脊(BAR)是一种惩罚回归方法,它通过可计算扩展的代用 (L_0) 正则化来执行变量选择。BAR 回归有很多吸引人的特点:它收敛于 (L_0) 惩罚的选择,作为 (L_2) 惩罚重新加权的结果,并且在高度相关的协变量上满足具有分组效应的 Oracle 特性。在本文中,我们研究了在具有复杂高维删减数据的半参数加速故障时间模型中进行变量选择的 BAR 程序。与 Buckley-James 型响应相结合,基于 BAR 的变量选择程序可在事件时间以复杂方式(如右删失、左删失或双删失)删失时执行。我们的方法采用两阶段循环坐标下降算法,通过沿坐标方向迭代估计伪生存响应和回归系数,使目标函数最小化。在一些弱正则性条件下,我们建立了所提出的 BAR 估计器的甲骨文属性和分组效应。我们进行了数值研究,以考察所提算法的有限样本性能,并提供了一个应用于真实数据的数据示例。
{"title":"Censored broken adaptive ridge regression in high-dimension","authors":"Jeongjin Lee, Taehwa Choi, Sangbum Choi","doi":"10.1007/s00180-023-01446-1","DOIUrl":"https://doi.org/10.1007/s00180-023-01446-1","url":null,"abstract":"<p>Broken adaptive ridge (BAR) is a penalized regression method that performs variable selection via a computationally scalable surrogate to <span>(L_0)</span> regularization. The BAR regression has many appealing features; it converges to selection with <span>(L_0)</span> penalties as a result of reweighting <span>(L_2)</span> penalties, and satisfies the oracle property with grouping effect for highly correlated covariates. In this paper, we investigate the BAR procedure for variable selection in a semiparametric accelerated failure time model with complex high-dimensional censored data. Coupled with Buckley-James-type responses, BAR-based variable selection procedures can be performed when event times are censored in complex ways, such as right-censored, left-censored, or double-censored. Our approach utilizes a two-stage cyclic coordinate descent algorithm to minimize the objective function by iteratively estimating the pseudo survival response and regression coefficients along the direction of coordinates. Under some weak regularity conditions, we establish both the oracle property and the grouping effect of the proposed BAR estimator. Numerical studies are conducted to investigate the finite-sample performance of the proposed algorithm and an application to real data is provided as a data example.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"262 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139482136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-dimensional penalized Bernstein support vector classifier 高维惩罚伯恩斯坦支持向量分类器
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-16 DOI: 10.1007/s00180-023-01448-z
Rachid Kharoubi, Abdallah Mkhadri, Karim Oualkacha

The support vector machine (SVM) is a powerful classifier used for binary classification to improve the prediction accuracy. However, the nondifferentiability of the SVM hinge loss function can lead to computational difficulties in high-dimensional settings. To overcome this problem, we rely on the Bernstein polynomial and propose a new smoothed version of the SVM hinge loss called the Bernstein support vector machine (BernSVC). This extension is suitable for the high dimension regime. As the BernSVC objective loss function is twice differentiable everywhere, we propose two efficient algorithms for computing the solution of the penalized BernSVC. The first algorithm is based on coordinate descent with the maximization-majorization principle and the second algorithm is the iterative reweighted least squares-type algorithm. Under standard assumptions, we derive a cone condition and a restricted strong convexity to establish an upper bound for the weighted lasso BernSVC estimator. By using a local linear approximation, we extend the latter result to the penalized BernSVC with nonconvex penalties SCAD and MCP. Our bound holds with high probability and achieves the so-called fast rate under mild conditions on the design matrix. Simulation studies are considered to illustrate the prediction accuracy of BernSVC relative to its competitors and also to compare the performance of the two algorithms in terms of computational timing and error estimation. The use of the proposed method is illustrated through analysis of three large-scale real data examples.

支持向量机(SVM)是一种功能强大的分类器,用于二元分类以提高预测精度。然而,SVM 铰链损失函数的不可分性会导致高维环境下的计算困难。为了克服这个问题,我们依靠伯恩斯坦多项式,提出了一种新的平滑 SVM 铰链损失版本,称为伯恩斯坦支持向量机(BernSVC)。这种扩展适用于高维度系统。由于 BernSVC 目标损失函数在任何地方都是二次微分的,因此我们提出了两种计算受惩罚 BernSVC 解的高效算法。第一种算法是基于最大化-主要化原则的坐标下降算法,第二种算法是迭代重权最小二乘法。在标准假设条件下,我们推导出一个圆锥条件和一个受限强凸性,从而建立了加权套索 BernSVC 估计器的上界。通过使用局部线性近似,我们将后一结果扩展到具有非凸惩罚 SCAD 和 MCP 的惩罚 BernSVC。我们的约束大概率成立,并在设计矩阵的温和条件下实现了所谓的快速率。仿真研究说明了 BernSVC 相对于其竞争对手的预测精度,并比较了两种算法在计算时间和误差估计方面的性能。通过对三个大规模真实数据实例的分析,说明了所提方法的用途。
{"title":"High-dimensional penalized Bernstein support vector classifier","authors":"Rachid Kharoubi, Abdallah Mkhadri, Karim Oualkacha","doi":"10.1007/s00180-023-01448-z","DOIUrl":"https://doi.org/10.1007/s00180-023-01448-z","url":null,"abstract":"<p>The support vector machine (SVM) is a powerful classifier used for binary classification to improve the prediction accuracy. However, the nondifferentiability of the SVM hinge loss function can lead to computational difficulties in high-dimensional settings. To overcome this problem, we rely on the Bernstein polynomial and propose a new smoothed version of the SVM hinge loss called the Bernstein support vector machine (BernSVC). This extension is suitable for the high dimension regime. As the BernSVC objective loss function is twice differentiable everywhere, we propose two efficient algorithms for computing the solution of the penalized BernSVC. The first algorithm is based on coordinate descent with the maximization-majorization principle and the second algorithm is the iterative reweighted least squares-type algorithm. Under standard assumptions, we derive a cone condition and a restricted strong convexity to establish an upper bound for the weighted lasso BernSVC estimator. By using a local linear approximation, we extend the latter result to the penalized BernSVC with nonconvex penalties SCAD and MCP. Our bound holds with high probability and achieves the so-called fast rate under mild conditions on the design matrix. Simulation studies are considered to illustrate the prediction accuracy of BernSVC relative to its competitors and also to compare the performance of the two algorithms in terms of computational timing and error estimation. The use of the proposed method is illustrated through analysis of three large-scale real data examples.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"262 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139482088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Random forest based quantile-oriented sensitivity analysis indices estimation 基于随机森林的面向量值的敏感性分析指数估算
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-12 DOI: 10.1007/s00180-023-01450-5
Kévin Elie-Dit-Cosaque, Véronique Maume-Deschamps

We propose a random forest based estimation procedure for Quantile-Oriented Sensitivity Analysis—QOSA. In order to be efficient, a cross-validation step on the leaf size of trees is required. Our full estimation procedure is tested on both simulated data and a real dataset. Our estimators use either the bootstrap samples or the original sample in the estimation. Also, they are either based on a quantile plug-in procedure (the R-estimators) or on a direct minimization (the Q-estimators). This leads to 8 different estimators which are compared on simulations. From these simulations, it seems that the estimation method based on a direct minimization is better than the one plugging the quantile. This is a significant result because the method with direct minimization requires only one sample and could therefore be preferred.

我们为面向量子敏感性分析(Quantile-Oriented Sensitivity Analysis-QOSA)提出了一种基于随机森林的估算程序。为了提高效率,需要对树的叶片大小进行交叉验证。我们的完整估计程序在模拟数据和真实数据集上进行了测试。我们的估算器在估算中使用自举样本或原始样本。此外,它们要么基于量子插入程序(R-估计器),要么基于直接最小化(Q-估计器)。由此产生了 8 种不同的估计方法,并通过模拟进行了比较。从模拟结果来看,基于直接最小化的估计方法要优于插入量值的估计方法。这是一个重要的结果,因为直接最小化方法只需要一个样本,因此可以优先采用。
{"title":"Random forest based quantile-oriented sensitivity analysis indices estimation","authors":"Kévin Elie-Dit-Cosaque, Véronique Maume-Deschamps","doi":"10.1007/s00180-023-01450-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01450-5","url":null,"abstract":"<p>We propose a random forest based estimation procedure for Quantile-Oriented Sensitivity Analysis—QOSA. In order to be efficient, a cross-validation step on the leaf size of trees is required. Our full estimation procedure is tested on both simulated data and a real dataset. Our estimators use either the bootstrap samples or the original sample in the estimation. Also, they are either based on a quantile plug-in procedure (the <i>R</i>-estimators) or on a direct minimization (the <i>Q</i>-estimators). This leads to 8 different estimators which are compared on simulations. From these simulations, it seems that the estimation method based on a direct minimization is better than the one plugging the quantile. This is a significant result because the method with direct minimization requires only one sample and could therefore be preferred.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"54 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139462061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structured dictionary learning of rating migration matrices for credit risk modeling 用于信用风险建模的评级迁移矩阵的结构化词典学习
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-10 DOI: 10.1007/s00180-023-01449-y

Abstract

Rating migration matrix is a crux to assess credit risks. Modeling and predicting these matrices are then an issue of great importance for risk managers in any financial institution. As a challenger to usual parametric modeling approaches, we propose a new structured dictionary learning model with auto-regressive regularization that is able to meet key expectations and constraints: small amount of data, fast evolution in time of these matrices, economic interpretability of the calibrated model. To show the model applicability, we present a numerical test with both synthetic and real data and a comparison study with the widely used parametric Gaussian Copula model: it turns out that our new approach based on dictionary learning significantly outperforms the Gaussian Copula model.

摘要 评级迁移矩阵是评估信贷风险的关键。因此,对这些矩阵进行建模和预测对任何金融机构的风险管理人员来说都是一个非常重要的问题。作为通常参数建模方法的挑战者,我们提出了一种具有自动回归正则化的新型结构化字典学习模型,该模型能够满足以下关键期望和约束条件:数据量小、这些矩阵随时间的快速演变、校准模型的经济可解释性。为了证明模型的适用性,我们用合成数据和真实数据进行了数值测试,并与广泛使用的参数高斯 Copula 模型进行了比较研究:结果表明,我们基于字典学习的新方法明显优于高斯 Copula 模型。
{"title":"Structured dictionary learning of rating migration matrices for credit risk modeling","authors":"","doi":"10.1007/s00180-023-01449-y","DOIUrl":"https://doi.org/10.1007/s00180-023-01449-y","url":null,"abstract":"<h3>Abstract</h3> <p>Rating migration matrix is a crux to assess credit risks. Modeling and predicting these matrices are then an issue of great importance for risk managers in any financial institution. As a challenger to usual parametric modeling approaches, we propose a new structured dictionary learning model with auto-regressive regularization that is able to meet key expectations and constraints: small amount of data, fast evolution in time of these matrices, economic interpretability of the calibrated model. To show the model applicability, we present a numerical test with both synthetic and real data and a comparison study with the widely used parametric Gaussian Copula model: it turns out that our new approach based on dictionary learning significantly outperforms the Gaussian Copula model.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"44 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139421947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A latent variable approach for modeling recall-based time-to-event data with Weibull distribution 基于 Weibull 分布的事件时间回忆数据建模的潜在变量方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-03 DOI: 10.1007/s00180-023-01444-3

Abstract

The ability of individuals to recall events is influenced by the time interval between the monitoring time and the occurrence of the event. In this article, we introduce a non-recall probability function that incorporates this information into our modeling framework. We model the time-to-event using the Weibull distribution and adopt a latent variable approach to handle situations where recall is not possible. In the classical framework, we obtain point estimators using expectation-maximization algorithm and construct the observed Fisher information matrix using missing information principle. Within the Bayesian paradigm, we derive point estimators under suitable choice of priors and calculate highest posterior density intervals using Markov Chain Monte Carlo samples. To assess the performance of the proposed estimators, we conduct an extensive simulation study. Additionally, we utilize age at menarche and breastfeeding datasets as examples to illustrate the effectiveness of the proposed methodology.

摘要 个人回忆事件的能力受监测时间与事件发生之间时间间隔的影响。在本文中,我们引入了一种非回忆概率函数,将这一信息纳入我们的建模框架。我们使用 Weibull 分布对事件发生时间进行建模,并采用潜变量方法来处理无法回忆的情况。在经典框架中,我们使用期望最大化算法获得点估计值,并利用缺失信息原理构建观察到的费雪信息矩阵。在贝叶斯范式中,我们在适当的先验选择下得到点估计器,并使用马尔可夫链蒙特卡罗样本计算最高后验密度区间。为了评估所提出的估计器的性能,我们进行了广泛的模拟研究。此外,我们还以初潮年龄和母乳喂养数据集为例,说明了所提方法的有效性。
{"title":"A latent variable approach for modeling recall-based time-to-event data with Weibull distribution","authors":"","doi":"10.1007/s00180-023-01444-3","DOIUrl":"https://doi.org/10.1007/s00180-023-01444-3","url":null,"abstract":"<h3>Abstract</h3> <p>The ability of individuals to recall events is influenced by the time interval between the monitoring time and the occurrence of the event. In this article, we introduce a non-recall probability function that incorporates this information into our modeling framework. We model the time-to-event using the Weibull distribution and adopt a latent variable approach to handle situations where recall is not possible. In the classical framework, we obtain point estimators using expectation-maximization algorithm and construct the observed Fisher information matrix using missing information principle. Within the Bayesian paradigm, we derive point estimators under suitable choice of priors and calculate highest posterior density intervals using Markov Chain Monte Carlo samples. To assess the performance of the proposed estimators, we conduct an extensive simulation study. Additionally, we utilize age at menarche and breastfeeding datasets as examples to illustrate the effectiveness of the proposed methodology.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Testing for linearity in scalar-on-function regression with responses missing at random 测试随机缺失响应的标量-函数回归的线性度
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-03 DOI: 10.1007/s00180-023-01445-2
Manuel Febrero-Bande, Pedro Galeano, Eduardo García-Portugués, Wenceslao González-Manteiga

A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the simplified method estimates the functional slope by simply discarding observations with missing responses. Second, the imputed method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the inverse probability weighted method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain.

本文提出了带有随机缺失(MAR)响应的标量响应功能线性模型(FLMSR)的拟合优度检验。该检验统计量依赖于以投影函数协变量为索引的标记经验过程,其在零假设下的分布是通过野外自举程序校准的。当样本有 MAR 反应时,检验的计算和性能依赖于对 FLMSR 函数斜率的准确估计。我们考虑了三种基于协变量函数主成分(FPCs)的估计方法。首先,简化方法通过简单地剔除缺失响应的观测值来估计功能斜率。第二,估算法通过使用简化估算器估算缺失的响应来估计功能斜率。第三,反概率加权法在估算时纳入了缺失响应生成机制。此外,还使用交叉验证和 LASSO 回归来选择每种估计器使用的 FPC。我们进行了多次蒙特卡罗实验,分析了测试程序与函数斜率估计器相结合的行为。结果表明,进行缺失反应归因的估计器的功率最高。测试程序被用于检查西班牙气象站的年平均晴天数与日平均气温曲线之间是否存在线性关系。
{"title":"Testing for linearity in scalar-on-function regression with responses missing at random","authors":"Manuel Febrero-Bande, Pedro Galeano, Eduardo García-Portugués, Wenceslao González-Manteiga","doi":"10.1007/s00180-023-01445-2","DOIUrl":"https://doi.org/10.1007/s00180-023-01445-2","url":null,"abstract":"<p>A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the <i>simplified</i> method estimates the functional slope by simply discarding observations with missing responses. Second, the <i>imputed</i> method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the <i>inverse probability weighted</i> method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"8 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation and prediction with data quality indexes in linear regressions 利用线性回归中的数据质量指标进行估计和预测
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-20 DOI: 10.1007/s00180-023-01441-6

Abstract

Despite many statistical applications brush the question of data quality aside, it is a fundamental concern inherent to external data collection. In this paper, data quality relates to the confidence one can have about the covariate values in a regression framework. More precisely, we study how to integrate the information of data quality given by a ((n times p)) -matrix, with n the number of individuals and p the number of explanatory variables. In this view, we suggest a latent variable model that drives the generation of the covariate values, and introduce a new algorithm that takes all these information into account for prediction. Our approach provides unbiased estimators of the regression coefficients, and allows to make predictions adapted to some given quality pattern. The usefulness of our procedure is illustrated through simulations and real-life applications. Kindly check and confirm whether the corresponding author is correctly identified.Yes

摘要 尽管许多统计应用将数据质量问题搁置一旁,但它却是外部数据收集所固有的一个基本问题。在本文中,数据质量关系到人们对回归框架中协变量值的置信度。更准确地说,我们研究的是如何整合由 (((n 次 p))-矩阵给出的数据质量信息。-矩阵给出的数据质量信息,其中 n 代表个体数量,p 代表解释变量数量。根据这一观点,我们提出了一个驱动协变量值生成的潜变量模型,并引入了一种新算法,将所有这些信息纳入预测考虑。我们的方法可提供无偏的回归系数估计值,并可根据给定的质量模式进行预测。我们通过模拟和实际应用说明了我们的程序的实用性。请检查并确认相应作者的身份是否正确。
{"title":"Estimation and prediction with data quality indexes in linear regressions","authors":"","doi":"10.1007/s00180-023-01441-6","DOIUrl":"https://doi.org/10.1007/s00180-023-01441-6","url":null,"abstract":"<h3>Abstract</h3> <p>Despite many statistical applications brush the question of data quality aside, it is a fundamental concern inherent to external data collection. In this paper, data quality relates to the confidence one can have about the covariate values in a regression framework. More precisely, we study how to integrate the information of data quality given by a <span> <span>((n times p))</span> </span>-matrix, with <em>n</em> the number of individuals and <em>p</em> the number of explanatory variables. In this view, we suggest a latent variable model that drives the generation of the covariate values, and introduce a new algorithm that takes all these information into account for prediction. Our approach provides unbiased estimators of the regression coefficients, and allows to make predictions adapted to some given quality pattern. The usefulness of our procedure is illustrated through simulations and real-life applications. <?oxy_aq_start?>Kindly check and confirm whether the corresponding author is correctly identified.<?oxy_aq_end?><?oxy_aqreply_start?>Yes<?oxy_aqreply_end?></p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"6 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138818581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An extended Langevinized ensemble Kalman filter for non-Gaussian dynamic systems 用于非高斯动态系统的扩展朗格文集合卡尔曼滤波器
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-14 DOI: 10.1007/s00180-023-01443-4
Peiyi Zhang, Tianning Dong, Faming Liang

State estimation for large-scale non-Gaussian dynamic systems remains an unresolved issue, given nonscalability of the existing particle filter algorithms. To address this issue, this paper extends the Langevinized ensemble Kalman filter (LEnKF) algorithm to non-Gaussian dynamic systems by introducing a latent Gaussian measurement variable to the dynamic system. The extended LEnKF algorithm can converge to the right filtering distribution as the number of stages become large, while inheriting the scalability of the LEnKF algorithm with respect to the sample size and state dimension. The performance of the extended LEnKF algorithm is illustrated by dynamic network embedding and dynamic Poisson spatial models.

鉴于现有粒子滤波算法的不可扩展性,大规模非高斯动态系统的状态估计一直是一个未解决的问题。为了解决这一问题,本文通过在非高斯动态系统中引入一个潜在的高斯测量变量,将Langevinized ensemble Kalman filter (LEnKF)算法扩展到非高斯动态系统。扩展的LEnKF算法在继承了LEnKF算法在样本量和状态维数方面的可扩展性的同时,可以随着阶段数的增大收敛到正确的滤波分布。通过动态网络嵌入和动态泊松空间模型说明了扩展的LEnKF算法的性能。
{"title":"An extended Langevinized ensemble Kalman filter for non-Gaussian dynamic systems","authors":"Peiyi Zhang, Tianning Dong, Faming Liang","doi":"10.1007/s00180-023-01443-4","DOIUrl":"https://doi.org/10.1007/s00180-023-01443-4","url":null,"abstract":"<p>State estimation for large-scale non-Gaussian dynamic systems remains an unresolved issue, given nonscalability of the existing particle filter algorithms. To address this issue, this paper extends the Langevinized ensemble Kalman filter (LEnKF) algorithm to non-Gaussian dynamic systems by introducing a latent Gaussian measurement variable to the dynamic system. The extended LEnKF algorithm can converge to the right filtering distribution as the number of stages become large, while inheriting the scalability of the LEnKF algorithm with respect to the sample size and state dimension. The performance of the extended LEnKF algorithm is illustrated by dynamic network embedding and dynamic Poisson spatial models.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"38 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138629856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An effective method for identifying clusters of robot strengths 识别机器人优势集群的有效方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-11 DOI: 10.1007/s00180-023-01442-5
Jen-Chieh Teng, Chin-Tsang Chiang, Alvin Lim

In the analysis of qualification stage data from FIRST Robotics Competition (FRC) championships, the ratio (1.67–1.68) of the number of observations (110–114 matches) to the number of parameters (66–68 robots) in each division has been found to be quite small for the most commonly used winning margin power rating (WMPR) model. This usually leads to imprecise estimates and inaccurate predictions in such three-on-three matches that FRC tournaments are composed of. With the recognition of a clustering feature in estimated robot strengths, a more flexible model with latent clusters of robots was proposed to alleviate overparameterization of the WMPR model. Since its structure can be regarded as a dimension reduction of the parameter space in the WMPR model, the identification of clusters of robot strengths is naturally transformed into a model selection problem. Instead of comparing a huge number of competing models ((7.76times 10^{67}) to (3.66times 10^{70})), we develop an effective method to estimate the number of clusters, clusters of robots and robot strengths in the format of qualification stage data from the FRC championships. The new method consists of two parts: (i) a combination of hierarchical and non-hierarchical classifications to determine candidate models; and (ii) variant goodness-of-fit criteria to select optimal models. In contrast to existing hierarchical classification, each step of our proposed non-hierarchical classification is based on estimated robot strengths from a candidate model in the preceding non-hierarchical classification step. A great advantage of the proposed methodology is its ability to consider the possibility of reassigning robots to other clusters. To reduce overestimation of the number of clusters by the mean squared prediction error criteria, corresponding Bayesian information criteria are further established as alternatives for model selection. With a coherent assembly of these essential elements, a systematic procedure is presented to perform the estimation of parameters. In addition, we propose two indices to measure the nested relation between clusters from any two models and monotonic association between robot strengths from any two models. Data from the 2018 and 2019 FRC championships and a simulation study are also used to illustrate the applicability and superiority of our proposed methodology.

在对 FIRST 机器人竞赛(FRC)锦标赛资格赛阶段的数据进行分析时发现,对于最常用的获胜能力评级(WMPR)模型而言,每个分区的观察数(110-114 场比赛)与参数数(66-68 个机器人)之比(1.67-1.68)相当小。这通常会导致在 FRC 锦标赛这种三对三比赛中出现不精确的估计和不准确的预测。由于认识到了机器人实力估算中的聚类特征,因此提出了一种具有潜在机器人聚类的更灵活模型,以减轻 WMPR 模型的参数过多问题。由于其结构可被视为 WMPR 模型参数空间的降维,因此机器人强度集群的识别自然而然地转化为模型选择问题。我们并没有比较大量的竞争模型((7.76乘以10^{67})到(3.66乘以10^{70})),而是开发了一种有效的方法,以FRC锦标赛资格赛阶段数据的形式来估计机器人集群的数量、机器人集群和机器人强度。新方法由两部分组成:(i) 结合层次分类法和非层次分类法确定候选模型;(ii) 采用变异拟合优度标准选择最优模型。与现有的分层分类法不同,我们提出的非分层分类法的每一步都是基于前一步非分层分类法中候选模型的机器人强度估计值。所提方法的一大优势是能够考虑将机器人重新分配到其他群组的可能性。为了减少均方预测误差标准对集群数量的过高估计,还进一步建立了相应的贝叶斯信息标准,作为模型选择的替代方案。通过对这些基本要素的整合,我们提出了一套系统的参数估计程序。此外,我们还提出了两个指数,用于衡量任意两个模型的聚类之间的嵌套关系,以及任意两个模型的机器人强度之间的单调关联。我们还使用了 2018 年和 2019 年 FRC 锦标赛的数据以及一项模拟研究来说明我们提出的方法的适用性和优越性。
{"title":"An effective method for identifying clusters of robot strengths","authors":"Jen-Chieh Teng, Chin-Tsang Chiang, Alvin Lim","doi":"10.1007/s00180-023-01442-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01442-5","url":null,"abstract":"<p>In the analysis of qualification stage data from FIRST Robotics Competition (FRC) championships, the ratio (1.67–1.68) of the number of observations (110–114 matches) to the number of parameters (66–68 robots) in each division has been found to be quite small for the most commonly used winning margin power rating (WMPR) model. This usually leads to imprecise estimates and inaccurate predictions in such three-on-three matches that FRC tournaments are composed of. With the recognition of a clustering feature in estimated robot strengths, a more flexible model with latent clusters of robots was proposed to alleviate overparameterization of the WMPR model. Since its structure can be regarded as a dimension reduction of the parameter space in the WMPR model, the identification of clusters of robot strengths is naturally transformed into a model selection problem. Instead of comparing a huge number of competing models <span>((7.76times 10^{67})</span> to <span>(3.66times 10^{70}))</span>, we develop an effective method to estimate the number of clusters, clusters of robots and robot strengths in the format of qualification stage data from the FRC championships. The new method consists of two parts: (i) a combination of hierarchical and non-hierarchical classifications to determine candidate models; and (ii) variant goodness-of-fit criteria to select optimal models. In contrast to existing hierarchical classification, each step of our proposed non-hierarchical classification is based on estimated robot strengths from a candidate model in the preceding non-hierarchical classification step. A great advantage of the proposed methodology is its ability to consider the possibility of reassigning robots to other clusters. To reduce overestimation of the number of clusters by the mean squared prediction error criteria, corresponding Bayesian information criteria are further established as alternatives for model selection. With a coherent assembly of these essential elements, a systematic procedure is presented to perform the estimation of parameters. In addition, we propose two indices to measure the nested relation between clusters from any two models and monotonic association between robot strengths from any two models. Data from the 2018 and 2019 FRC championships and a simulation study are also used to illustrate the applicability and superiority of our proposed methodology.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"12 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138576940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1