Journal of Applied Statistics最新文献_第10页

A robust distance-based approach for detecting multidimensional outliers. 一种基于距离的检测多维异常值的鲁棒方法。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-11-07 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2422403

R Lakshmi, T A Sajesh

Identifying outliers in data analysis is a critical task, as outliers can significantly influence the results and conclusions drawn from a dataset. This study explores the use of the Mahalanobis distance metric for detecting outliers in multivariate data, focusing on a novel approach inspired by the work of M. Falk, [On mad and comedians, Ann. Inst. Stat. Math. 49 (1997), pp. 615-644]. The proposed method is rigorously tested through extensive simulation analysis, where it demonstrates high True Positive Rates (TPR) and low False Positive Rates (FPR) when compared to other existing outlier detection techniques. Through extensive simulation analysis, we empirically evaluate the affine equivariance and breakdown properties of our proposed distance measure and it is evident from the outputs that our robust distance measure demonstrates effective results with respect to the measures FPR and TPR. The proposed method was applied to seven different datasets, showing promising true positive rates (TPR) and false positive rates (FPR), and it outperformed several well-known outlier identification approaches. We can effectively use our proposed distance measure in fields demanding outlier detection.

识别数据分析中的异常值是一项关键任务，因为异常值可以显著影响从数据集得出的结果和结论。本研究探索了马氏距离度量在多变量数据中检测异常值的使用，重点是受M. Falk的工作启发的一种新方法，[关于疯子和喜剧演员，安。《统计数学》49(1997)，第615-644页。所提出的方法通过广泛的仿真分析进行了严格的测试，与其他现有的异常值检测技术相比，它具有高的真阳性率（TPR）和低的假阳性率（FPR）。通过广泛的模拟分析，我们经验地评估了我们提出的距离测量的仿射等变性和击穿特性，从输出中可以明显看出，我们的鲁棒距离测量在FPR和TPR方面显示了有效的结果。该方法应用于7个不同的数据集，显示出良好的真阳性率（TPR）和假阳性率（FPR），并且优于几种已知的离群值识别方法。我们可以在需要异常值检测的领域有效地使用我们提出的距离度量。

{"title":"A robust distance-based approach for detecting multidimensional outliers.","authors":"R Lakshmi, T A Sajesh","doi":"10.1080/02664763.2024.2422403","DOIUrl":"10.1080/02664763.2024.2422403","url":null,"abstract":"Identifying outliers in data analysis is a critical task, as outliers can significantly influence the results and conclusions drawn from a dataset. This study explores the use of the Mahalanobis distance metric for detecting outliers in multivariate data, focusing on a novel approach inspired by the work of M. Falk, [On mad and comedians, Ann. Inst. Stat. Math. 49 (1997), pp. 615-644]. The proposed method is rigorously tested through extensive simulation analysis, where it demonstrates high True Positive Rates (TPR) and low False Positive Rates (FPR) when compared to other existing outlier detection techniques. Through extensive simulation analysis, we empirically evaluate the affine equivariance and breakdown properties of our proposed distance measure and it is evident from the outputs that our robust distance measure demonstrates effective results with respect to the measures FPR and TPR. The proposed method was applied to seven different datasets, showing promising true positive rates (TPR) and false positive rates (FPR), and it outperformed several well-known outlier identification approaches. We can effectively use our proposed distance measure in fields demanding outlier detection.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1278-1298"},"PeriodicalIF":1.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144016593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering in point processes on linear networks using nearest neighbour volumes. 基于最近邻体积的线性网络点过程聚类。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-11-07 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2411214

Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés

This study introduces a novel method specifically designed to detect clusters of points within linear networks. This method extends a classification approach used for point processes in spatial contexts. Unlike traditional methods that operate on planar spaces, our approach adapts to the unique geometric challenges of linear networks, where classical properties of point processes are altered, and intuitive data visualisation becomes more complex. Our method utilises the distribution of the Kth nearest neighbour volumes, extending planar-based clustering techniques to identify regions of increased point density within a network. This approach is particularly effective for distinguishing overlapping Poisson processes within the same linear network. We demonstrate the practical utility of our method through applications to road traffic accident data from two Colombian cities, Bogota and Medellin. Our results reveal distinct clusters of high-density points in road segments where severe traffic accidents (resulting in injuries or fatalities) are most likely to occur, highlighting areas of increased risk. These clusters were primarily located on major arterial roads with high traffic volumes. In contrast, low-density points corresponded to areas with fewer accidents, likely due to lower traffic flow or other mitigating factors. Our findings provide valuable insights for urban planning and road safety management.

本研究介绍了一种专门设计用于检测线性网络中点簇的新方法。该方法扩展了用于空间上下文中的点过程的分类方法。与在平面空间上操作的传统方法不同，我们的方法适应线性网络的独特几何挑战，其中点过程的经典属性被改变，直观的数据可视化变得更加复杂。我们的方法利用第k个最近邻体的分布，扩展基于平面的聚类技术来识别网络中点密度增加的区域。这种方法对于区分同一线性网络中的重叠泊松过程特别有效。我们通过对哥伦比亚两个城市波哥大和麦德林的道路交通事故数据的应用，展示了我们方法的实际效用。我们的研究结果显示，在最可能发生严重交通事故（导致受伤或死亡）的路段中，高密度点的明显集群突出了风险增加的区域。这些集群主要位于交通流量大的主干道上。相比之下，低密度点对应的是事故较少的地区，可能是由于交通流量较低或其他缓解因素。我们的研究结果为城市规划和道路安全管理提供了有价值的见解。

{"title":"Clustering in point processes on linear networks using nearest neighbour volumes.","authors":"Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés","doi":"10.1080/02664763.2024.2411214","DOIUrl":"10.1080/02664763.2024.2411214","url":null,"abstract":"This study introduces a novel method specifically designed to detect clusters of points within linear networks. This method extends a classification approach used for point processes in spatial contexts. Unlike traditional methods that operate on planar spaces, our approach adapts to the unique geometric challenges of linear networks, where classical properties of point processes are altered, and intuitive data visualisation becomes more complex. Our method utilises the distribution of the Kth nearest neighbour volumes, extending planar-based clustering techniques to identify regions of increased point density within a network. This approach is particularly effective for distinguishing overlapping Poisson processes within the same linear network. We demonstrate the practical utility of our method through applications to road traffic accident data from two Colombian cities, Bogota and Medellin. Our results reveal distinct clusters of high-density points in road segments where severe traffic accidents (resulting in injuries or fatalities) are most likely to occur, highlighting areas of increased risk. These clusters were primarily located on major arterial roads with high traffic volumes. In contrast, low-density points corresponded to areas with fewer accidents, likely due to lower traffic flow or other mitigating factors. Our findings provide valuable insights for urban planning and road safety management.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"993-1016"},"PeriodicalIF":1.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951330/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust estimation of the incubation period and the time of exposure using γ-divergence. 利用γ-散度稳健估计潜伏期和暴露时间。

IF 1.2 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-11-06 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2420221

Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi

Estimating the exposure time to single infectious pathogens and the associated incubation period, based on symptom onset data, is crucial for identifying infection sources and implementing public health interventions. However, data from rapid surveillance systems designed for early outbreak warning often come with outliers originated from individuals who were not directly exposed to the initial source of infection (i.e. tertiary and subsequent infection cases), making the estimation of exposure time challenging. To address this issue, this study uses a three-parameter lognormal distribution and proposes a new γ-divergence-based robust approach for estimating the parameter corresponding to exposure time with a tailored optimization procedure using the majorization-minimization algorithm, which ensures the monotonic decreasing property of the objective function. Comprehensive numerical experiments and real data analyses suggest that our method is superior to conventional methods in terms of bias, mean squared error, and coverage probability of 95% confidence intervals.

根据症状发作数据估计接触单一传染性病原体的时间和相关潜伏期，对于确定感染源和实施公共卫生干预措施至关重要。然而，为疫情早期预警而设计的快速监测系统的数据往往存在异常值，这些异常值来自没有直接接触最初感染源（即第三次和随后的感染病例）的个人，这使得对接触时间的估计具有挑战性。为了解决这一问题，本文采用三参数对数正态分布，提出了一种新的基于γ-散度的鲁棒方法来估计暴露时间对应的参数，并使用最大化-最小化算法定制优化程序，保证了目标函数的单调递减性。综合数值实验和实际数据分析表明，我们的方法在偏倚、均方误差和95%置信区间的覆盖概率方面优于传统方法。

{"title":"Robust estimation of the incubation period and the time of exposure using γ-divergence.","authors":"Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi","doi":"10.1080/02664763.2024.2420221","DOIUrl":"https://doi.org/10.1080/02664763.2024.2420221","url":null,"abstract":"Estimating the exposure time to single infectious pathogens and the associated incubation period, based on symptom onset data, is crucial for identifying infection sources and implementing public health interventions. However, data from rapid surveillance systems designed for early outbreak warning often come with outliers originated from individuals who were not directly exposed to the initial source of infection (i.e. tertiary and subsequent infection cases), making the estimation of exposure time challenging. To address this issue, this study uses a three-parameter lognormal distribution and proposes a new γ-divergence-based robust approach for estimating the parameter corresponding to exposure time with a tailored optimization procedure using the majorization-minimization algorithm, which ensures the monotonic decreasing property of the objective function. Comprehensive numerical experiments and real data analyses suggest that our method is superior to conventional methods in terms of bias, mean squared error, and coverage probability of 95% confidence intervals.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1239-1257"},"PeriodicalIF":1.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035932/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143992898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An optimal subsampling design for large-scale Cox model with censored data. 带截尾数据的大尺度Cox模型的最优次抽样设计。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-11-04 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2423234

Shiqi Liu, Zilong Xie, Ming Zheng, Wen Yu

Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.

子采样设计有助于减少大规模数据分析的计算负荷和存储成本。对于具有正确审查的大量生存数据，我们在广泛使用的Cox模型下提出了一类最优子抽样设计。建议的设计利用结果和协变量的信息。可以自适应地推导出不同形式的设计，以满足各种目标，例如优化总体估计精度或最小化估计量的特定线性组合的变化。在给定下采样数据的情况下，采用逆概率加权法估计模型参数。所得到的估计量是一致且渐近正态分布的。仿真结果表明，通过使用可比较样本量的子抽样数据，所提出的子抽样设计比均匀子抽样产生更有效的估计器。此外，相对于全数据估计，子采样估计显著降低了计算负荷和存储成本。通过对一个实际数据实例的分析来说明。

{"title":"An optimal subsampling design for large-scale Cox model with censored data.","authors":"Shiqi Liu, Zilong Xie, Ming Zheng, Wen Yu","doi":"10.1080/02664763.2024.2423234","DOIUrl":"10.1080/02664763.2024.2423234","url":null,"abstract":"Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 7","pages":"1315-1341"},"PeriodicalIF":1.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144199240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient fully Bayesian approach to brain activity mapping with complex-valued fMRI data. 利用复杂值fMRI数据进行脑活动映射的高效全贝叶斯方法。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-11-04 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2422392

Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown

Functional magnetic resonance imaging (fMRI) enables indirect detection of brain activity changes via the blood-oxygen-level-dependent (BOLD) signal. Conventional analysis methods mainly rely on the real-valued magnitude of these signals. In contrast, research suggests that analyzing both real and imaginary components of the complex-valued fMRI (cv-fMRI) signal provides a more holistic approach that can increase power to detect neuronal activation. We propose a fully Bayesian model for brain activity mapping with cv-fMRI data. Our model accommodates temporal and spatial dynamics. Additionally, we propose a computationally efficient sampling algorithm, which enhances processing speed through image partitioning. Our approach is shown to be computationally efficient via image partitioning and parallel computation while being competitive with state-of-the-art methods. We support these claims with both simulated numerical studies and an application to real cv-fMRI data obtained from a finger-tapping experiment.

功能磁共振成像（fMRI）能够通过血氧水平依赖（BOLD）信号间接检测大脑活动的变化。传统的分析方法主要依赖于这些信号的实值幅度。相比之下，研究表明，分析复合值fMRI （cv-fMRI）信号的实部和虚部提供了一种更全面的方法，可以提高检测神经元激活的能力。我们提出了一个完全贝叶斯模型的脑活动映射与cv-fMRI数据。我们的模型适应了时间和空间的动态。此外，我们提出了一种计算效率高的采样算法，该算法通过图像分割来提高处理速度。我们的方法通过图像分割和并行计算显示出计算效率，同时与最先进的方法竞争。我们通过模拟数值研究和从手指敲击实验中获得的真实cv-fMRI数据的应用来支持这些说法。

引用次数: 0

Prediction intervals and bands with improved coverage for functional data under noisy discrete observation. 在有噪声的离散观测条件下，提高功能数据覆盖范围的预测区间和频带。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-10-28 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2420223

David Kraus

We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.

我们重新审视函数数据分析中的经典情况，其中曲线是在离散的，可能是稀疏的和不规则的，带有观察噪声的参数处观察到的。我们的重点是通过预测区间和波段来重建单个曲线。标准方法包括两步：首先，通过惩罚样条估计曲线的均值和协方差函数以及观测噪声方差函数，其次，在高斯假设下，推导给定观测数据的曲线的条件分布，并构造具有所需性质的预测集，通常采用预测分布的抽样。这种方法建立得很好，被广泛使用，理论上是有效的，但实际上，它令人惊讶地在其关键属性上失败了：以这种方式构建的预测集通常没有所需的覆盖范围。实际覆盖率低于标称覆盖率。我们调查了这个问题的原因，并提出了一个计算上可行的补救措施，导致预测区域具有更好的覆盖率。我们的方法从样条估计量的近似分布中抽样来解释预测模型的不确定性，样条估计量的协方差是由一种新的三明治估计量估计的。我们的方法也适用于协变量调整模型的重要情况。

{"title":"Prediction intervals and bands with improved coverage for functional data under noisy discrete observation.","authors":"David Kraus","doi":"10.1080/02664763.2024.2420223","DOIUrl":"10.1080/02664763.2024.2420223","url":null,"abstract":"We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1258-1277"},"PeriodicalIF":1.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035946/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144010105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A non-linear integer-valued autoregressive model with zero-inflated data series. 具有零膨胀数据序列的非线性整数值自回归模型。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-10-26 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2419495

Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić

A new non-linear stationary process for time series of counts is introduced. The process is composed of the survival and innovation component. The survival component is based on the generalized zero-modified geometric thinning operator, where the innovation process figures in the survival component as well. A few probability distributions for the innovation process have been discussed, in order to adjust the model for observed series with the excess number of zeros. The conditional maximum likelihood and the conditional least squares methods are investigated for the estimation of the model parameters. The practical aspect of the model is presented on some real-life data sets, where we observe data with inflation as well as deflation of zeroes so we can notice how the model can be adjusted with the proper parameter selection.

介绍了一种新的计数时间序列的非线性平稳过程。这个过程是由生存和创新组成的。生存分量基于广义零修正几何稀疏算子，其中创新过程也体现在生存分量中。本文讨论了创新过程的几个概率分布，以便对观测序列的多余零数进行模型调整。研究了条件极大似然法和条件最小二乘法对模型参数的估计。模型的实际方面是在一些现实生活中的数据集上提出的，在这些数据集中，我们观察了带有膨胀和紧缩的数据，这样我们就可以注意到如何通过适当的参数选择来调整模型。

引用次数: 0

Evaluating the median p-value method for assessing the statistical significance of tests when using multiple imputation. 评估中位p值方法，用于评估多重输入时检验的统计显著性。

IF 1.2 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-10-25 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2418473

Peter C Austin, Iris Eekhout, Stef van Buuren

Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median p-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated p-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median p-value method should not be used to assess statistical significance across imputed datasets.

鲁宾规则通常用于汇集跨估算样本的统计分析结果，当使用多重估算时。当输入数据集中的分析结果不是统计量及其相关的标准误差，而是检验统计量（例如学生t检验）时，Rubin规则不能使用。虽然已经提出了复杂的方法来汇集跨输入样本的测试统计，但这些方法尚未在许多流行的统计软件包中实现。中位数p值法已被提出用于池化检验统计量。合并检验统计量的统计显著性水平是整个输入样本的相关p值的中位数。我们用九项统计检验来评价该方法的性能：学生t检验、Wilcoxon秩和检验、方差分析、Kruskal-Wallis检验、Pearson和Spearman相关系数的显著性检验、卡方检验、线性回归和逻辑回归回归系数的显著性检验。对于每个测试，经验I型错误率高于广告率。通货膨胀的程度随着缺失数据的增加而增加。中位数p值法不应用于评估跨输入数据集的统计显著性。

{"title":"Evaluating the median p-value method for assessing the statistical significance of tests when using multiple imputation.","authors":"Peter C Austin, Iris Eekhout, Stef van Buuren","doi":"10.1080/02664763.2024.2418473","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418473","url":null,"abstract":"Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median p-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated p-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median p-value method should not be used to assess statistical significance across imputed datasets.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1161-1176"},"PeriodicalIF":1.2,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144012737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigating the choice of the duration in DDMS models through a parametric link. 通过参数链接减少了DDMS模型中持续时间的选择。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-10-24 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2419505

Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi

One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.

在依赖持续时间的马尔可夫切换（DDMS）模型中，最重要的超参数之一是隐藏状态的持续时间。由于目前还没有评估该持续时间或测试给定持续时间是否适合给定数据集的过程，因此必须启发式地证明临时持续时间的选择是合理的。在本文中，我们提出并研究了一种方法，当预测是目标时，减少了DDMS模型中持续时间的选择。本文的新颖之处在于使用非对称Aranda-Ordaz参数链接函数来模拟DDMS模型中的转移概率，而不是通常使用的logit链接。这种方法背后的思想是，任何不正确的持续时间选择都可以通过链接中的参数进行补偿，从而增加模型的灵活性。基于DDMS模型的经典应用，采用两个蒙特卡罗模拟来评估该方法。此外，本文还对标准普尔500指数的波动率进行了实证研究，验证了模型的有效性。

{"title":"Mitigating the choice of the duration in DDMS models through a parametric link.","authors":"Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi","doi":"10.1080/02664763.2024.2419505","DOIUrl":"10.1080/02664763.2024.2419505","url":null,"abstract":"One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1219-1238"},"PeriodicalIF":1.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144018792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A semiparametric accelerated failure time-based mixture cure tree. 基于半参数加速失效时间的混合修复树。

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics

Pub Date : 2024-10-23 eCollection Date: 2025-01-01 DOI: 10.1080/02664763.2024.2418476

Wisdom Aselisewine, Suvra Pal, Helton Saulo

The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.

混合治愈率模型（MCM）是最广泛用于分析治愈亚组生存数据的模型。在这种情况下，对治愈概率进行建模的最常见策略是假设一个具有已知链接函数的广义线性模型，例如logit链接函数。然而，logit模型只能捕捉协变量对治愈概率的简单影响。在本文中，我们提出了一种新的MCM，其中治愈概率使用基于决策树的分类器建模，未治愈的生存分布使用加速故障时间结构建模。为了估计模型参数，我们开发了一种期望最大化算法。我们的仿真研究表明，与基于逻辑的MCM和基于样条的MCM相比，该模型在捕获非线性分类边界方面表现更好。这使得对治愈概率的估计更加准确和精确，从而提高了治愈预测的准确性。我们进一步证明，捕获非线性分类边界也改善了对未治愈受试者生存分布的估计结果。最后，我们将提出的模型和EM算法应用于现有的骨髓移植数据分析。

{"title":"A semiparametric accelerated failure time-based mixture cure tree.","authors":"Wisdom Aselisewine, Suvra Pal, Helton Saulo","doi":"10.1080/02664763.2024.2418476","DOIUrl":"10.1080/02664763.2024.2418476","url":null,"abstract":"The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1177-1194"},"PeriodicalIF":1.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144020246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0