Pub Date : 2024-11-07eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2422403
R Lakshmi, T A Sajesh
Identifying outliers in data analysis is a critical task, as outliers can significantly influence the results and conclusions drawn from a dataset. This study explores the use of the Mahalanobis distance metric for detecting outliers in multivariate data, focusing on a novel approach inspired by the work of M. Falk, [On mad and comedians, Ann. Inst. Stat. Math. 49 (1997), pp. 615-644]. The proposed method is rigorously tested through extensive simulation analysis, where it demonstrates high True Positive Rates (TPR) and low False Positive Rates (FPR) when compared to other existing outlier detection techniques. Through extensive simulation analysis, we empirically evaluate the affine equivariance and breakdown properties of our proposed distance measure and it is evident from the outputs that our robust distance measure demonstrates effective results with respect to the measures FPR and TPR. The proposed method was applied to seven different datasets, showing promising true positive rates (TPR) and false positive rates (FPR), and it outperformed several well-known outlier identification approaches. We can effectively use our proposed distance measure in fields demanding outlier detection.
{"title":"A robust distance-based approach for detecting multidimensional outliers.","authors":"R Lakshmi, T A Sajesh","doi":"10.1080/02664763.2024.2422403","DOIUrl":"10.1080/02664763.2024.2422403","url":null,"abstract":"<p><p>Identifying outliers in data analysis is a critical task, as outliers can significantly influence the results and conclusions drawn from a dataset. This study explores the use of the Mahalanobis distance metric for detecting outliers in multivariate data, focusing on a novel approach inspired by the work of M. Falk, [<i>On mad and comedians</i>, Ann. Inst. Stat. Math. 49 (1997), pp. 615-644]. The proposed method is rigorously tested through extensive simulation analysis, where it demonstrates high True Positive Rates (TPR) and low False Positive Rates (FPR) when compared to other existing outlier detection techniques. Through extensive simulation analysis, we empirically evaluate the affine equivariance and breakdown properties of our proposed distance measure and it is evident from the outputs that our robust distance measure demonstrates effective results with respect to the measures FPR and TPR. The proposed method was applied to seven different datasets, showing promising true positive rates (TPR) and false positive rates (FPR), and it outperformed several well-known outlier identification approaches. We can effectively use our proposed distance measure in fields demanding outlier detection.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1278-1298"},"PeriodicalIF":1.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144016593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2411214
Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés
This study introduces a novel method specifically designed to detect clusters of points within linear networks. This method extends a classification approach used for point processes in spatial contexts. Unlike traditional methods that operate on planar spaces, our approach adapts to the unique geometric challenges of linear networks, where classical properties of point processes are altered, and intuitive data visualisation becomes more complex. Our method utilises the distribution of the Kth nearest neighbour volumes, extending planar-based clustering techniques to identify regions of increased point density within a network. This approach is particularly effective for distinguishing overlapping Poisson processes within the same linear network. We demonstrate the practical utility of our method through applications to road traffic accident data from two Colombian cities, Bogota and Medellin. Our results reveal distinct clusters of high-density points in road segments where severe traffic accidents (resulting in injuries or fatalities) are most likely to occur, highlighting areas of increased risk. These clusters were primarily located on major arterial roads with high traffic volumes. In contrast, low-density points corresponded to areas with fewer accidents, likely due to lower traffic flow or other mitigating factors. Our findings provide valuable insights for urban planning and road safety management.
{"title":"Clustering in point processes on linear networks using nearest neighbour volumes.","authors":"Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés","doi":"10.1080/02664763.2024.2411214","DOIUrl":"10.1080/02664763.2024.2411214","url":null,"abstract":"<p><p>This study introduces a novel method specifically designed to detect clusters of points within linear networks. This method extends a classification approach used for point processes in spatial contexts. Unlike traditional methods that operate on planar spaces, our approach adapts to the unique geometric challenges of linear networks, where classical properties of point processes are altered, and intuitive data visualisation becomes more complex. Our method utilises the distribution of the <i>K</i>th nearest neighbour volumes, extending planar-based clustering techniques to identify regions of increased point density within a network. This approach is particularly effective for distinguishing overlapping Poisson processes within the same linear network. We demonstrate the practical utility of our method through applications to road traffic accident data from two Colombian cities, Bogota and Medellin. Our results reveal distinct clusters of high-density points in road segments where severe traffic accidents (resulting in injuries or fatalities) are most likely to occur, highlighting areas of increased risk. These clusters were primarily located on major arterial roads with high traffic volumes. In contrast, low-density points corresponded to areas with fewer accidents, likely due to lower traffic flow or other mitigating factors. Our findings provide valuable insights for urban planning and road safety management.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"993-1016"},"PeriodicalIF":1.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951330/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-06eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2420221
Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi
Estimating the exposure time to single infectious pathogens and the associated incubation period, based on symptom onset data, is crucial for identifying infection sources and implementing public health interventions. However, data from rapid surveillance systems designed for early outbreak warning often come with outliers originated from individuals who were not directly exposed to the initial source of infection (i.e. tertiary and subsequent infection cases), making the estimation of exposure time challenging. To address this issue, this study uses a three-parameter lognormal distribution and proposes a new γ-divergence-based robust approach for estimating the parameter corresponding to exposure time with a tailored optimization procedure using the majorization-minimization algorithm, which ensures the monotonic decreasing property of the objective function. Comprehensive numerical experiments and real data analyses suggest that our method is superior to conventional methods in terms of bias, mean squared error, and coverage probability of 95% confidence intervals.
{"title":"Robust estimation of the incubation period and the time of exposure using <i>γ</i>-divergence.","authors":"Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi","doi":"10.1080/02664763.2024.2420221","DOIUrl":"https://doi.org/10.1080/02664763.2024.2420221","url":null,"abstract":"<p><p>Estimating the exposure time to single infectious pathogens and the associated incubation period, based on symptom onset data, is crucial for identifying infection sources and implementing public health interventions. However, data from rapid surveillance systems designed for early outbreak warning often come with outliers originated from individuals who were not directly exposed to the initial source of infection (i.e. tertiary and subsequent infection cases), making the estimation of exposure time challenging. To address this issue, this study uses a three-parameter lognormal distribution and proposes a new <i>γ</i>-divergence-based robust approach for estimating the parameter corresponding to exposure time with a tailored optimization procedure using the majorization-minimization algorithm, which ensures the monotonic decreasing property of the objective function. Comprehensive numerical experiments and real data analyses suggest that our method is superior to conventional methods in terms of bias, mean squared error, and coverage probability of 95% confidence intervals.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1239-1257"},"PeriodicalIF":1.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035932/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143992898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2423234
Shiqi Liu, Zilong Xie, Ming Zheng, Wen Yu
Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.
{"title":"An optimal subsampling design for large-scale Cox model with censored data.","authors":"Shiqi Liu, Zilong Xie, Ming Zheng, Wen Yu","doi":"10.1080/02664763.2024.2423234","DOIUrl":"10.1080/02664763.2024.2423234","url":null,"abstract":"<p><p>Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 7","pages":"1315-1341"},"PeriodicalIF":1.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144199240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2422392
Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown
Functional magnetic resonance imaging (fMRI) enables indirect detection of brain activity changes via the blood-oxygen-level-dependent (BOLD) signal. Conventional analysis methods mainly rely on the real-valued magnitude of these signals. In contrast, research suggests that analyzing both real and imaginary components of the complex-valued fMRI (cv-fMRI) signal provides a more holistic approach that can increase power to detect neuronal activation. We propose a fully Bayesian model for brain activity mapping with cv-fMRI data. Our model accommodates temporal and spatial dynamics. Additionally, we propose a computationally efficient sampling algorithm, which enhances processing speed through image partitioning. Our approach is shown to be computationally efficient via image partitioning and parallel computation while being competitive with state-of-the-art methods. We support these claims with both simulated numerical studies and an application to real cv-fMRI data obtained from a finger-tapping experiment.
{"title":"Efficient fully Bayesian approach to brain activity mapping with complex-valued fMRI data.","authors":"Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown","doi":"10.1080/02664763.2024.2422392","DOIUrl":"10.1080/02664763.2024.2422392","url":null,"abstract":"<p><p>Functional magnetic resonance imaging (fMRI) enables indirect detection of brain activity changes via the blood-oxygen-level-dependent (BOLD) signal. Conventional analysis methods mainly rely on the real-valued magnitude of these signals. In contrast, research suggests that analyzing both real and imaginary components of the complex-valued fMRI (cv-fMRI) signal provides a more holistic approach that can increase power to detect neuronal activation. We propose a fully Bayesian model for brain activity mapping with cv-fMRI data. Our model accommodates temporal and spatial dynamics. Additionally, we propose a computationally efficient sampling algorithm, which enhances processing speed through image partitioning. Our approach is shown to be computationally efficient via image partitioning and parallel computation while being competitive with state-of-the-art methods. We support these claims with both simulated numerical studies and an application to real cv-fMRI data obtained from a finger-tapping experiment.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1299-1314"},"PeriodicalIF":1.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143998676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2420223
David Kraus
We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.
{"title":"Prediction intervals and bands with improved coverage for functional data under noisy discrete observation.","authors":"David Kraus","doi":"10.1080/02664763.2024.2420223","DOIUrl":"10.1080/02664763.2024.2420223","url":null,"abstract":"<p><p>We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1258-1277"},"PeriodicalIF":1.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035946/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144010105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2419495
Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić
A new non-linear stationary process for time series of counts is introduced. The process is composed of the survival and innovation component. The survival component is based on the generalized zero-modified geometric thinning operator, where the innovation process figures in the survival component as well. A few probability distributions for the innovation process have been discussed, in order to adjust the model for observed series with the excess number of zeros. The conditional maximum likelihood and the conditional least squares methods are investigated for the estimation of the model parameters. The practical aspect of the model is presented on some real-life data sets, where we observe data with inflation as well as deflation of zeroes so we can notice how the model can be adjusted with the proper parameter selection.
{"title":"A non-linear integer-valued autoregressive model with zero-inflated data series.","authors":"Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić","doi":"10.1080/02664763.2024.2419495","DOIUrl":"10.1080/02664763.2024.2419495","url":null,"abstract":"<p><p>A new non-linear stationary process for time series of counts is introduced. The process is composed of the survival and innovation component. The survival component is based on the generalized zero-modified geometric thinning operator, where the innovation process figures in the survival component as well. A few probability distributions for the innovation process have been discussed, in order to adjust the model for observed series with the excess number of zeros. The conditional maximum likelihood and the conditional least squares methods are investigated for the estimation of the model parameters. The practical aspect of the model is presented on some real-life data sets, where we observe data with inflation as well as deflation of zeroes so we can notice how the model can be adjusted with the proper parameter selection.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1195-1218"},"PeriodicalIF":1.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143995010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-25eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2418473
Peter C Austin, Iris Eekhout, Stef van Buuren
Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median p-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated p-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median p-value method should not be used to assess statistical significance across imputed datasets.
{"title":"Evaluating the median <i>p</i>-value method for assessing the statistical significance of tests when using multiple imputation.","authors":"Peter C Austin, Iris Eekhout, Stef van Buuren","doi":"10.1080/02664763.2024.2418473","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418473","url":null,"abstract":"<p><p>Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median <i>p</i>-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated <i>p</i>-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median <i>p</i>-value method should not be used to assess statistical significance across imputed datasets.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1161-1176"},"PeriodicalIF":1.2,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144012737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2419505
Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi
One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.
{"title":"Mitigating the choice of the duration in DDMS models through a parametric link.","authors":"Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi","doi":"10.1080/02664763.2024.2419505","DOIUrl":"10.1080/02664763.2024.2419505","url":null,"abstract":"<p><p>One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1219-1238"},"PeriodicalIF":1.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144018792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-23eCollection Date: 2025-01-01DOI: 10.1080/02664763.2024.2418476
Wisdom Aselisewine, Suvra Pal, Helton Saulo
The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.
{"title":"A semiparametric accelerated failure time-based mixture cure tree.","authors":"Wisdom Aselisewine, Suvra Pal, Helton Saulo","doi":"10.1080/02664763.2024.2418476","DOIUrl":"10.1080/02664763.2024.2418476","url":null,"abstract":"<p><p>The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1177-1194"},"PeriodicalIF":1.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144020246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}