{"title":"Acknowledgement of Referees' Services Remerciements aux membres des jurys","authors":"","doi":"10.1002/cjs.11840","DOIUrl":"https://doi.org/10.1002/cjs.11840","url":null,"abstract":"","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143497132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many common clustering methods cannot be used for clustering balanced multivariate longitudinal data in cases where the covariance of variables is a function of the time points. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model that decomposes each mixture component's joint density function into a copula and marginal distribution functions. In this decomposition, the Gaussian copula is used due to its mathematical tractability and Gaussian kernel functions are used to estimate the marginal distributions. A generalized expectation-maximization algorithm is used to estimate the model parameters. The performance of the proposed model is assessed in a simulation study and on two real datasets. The proposed model is shown to have effective performance in comparison with standard methods, such as -means with dynamic time warping clustering, latent growth models and functional high-dimensional data clustering.
在变量协方差是时间点函数的情况下,许多常见的聚类方法都无法用于平衡多变量纵向数据的聚类。本文提出了一种共轭核混合模型(CKMM),用于对这类数据进行聚类。CKMM 是一种有限混合物模型,它将每个混合物成分的联合密度函数分解为 copula 和边际分布函数。在这一分解中,由于高斯协方差在数学上的可操作性,因此使用了高斯协方差,并使用高斯核函数来估计边际分布。使用广义期望最大化算法来估计模型参数。在模拟研究和两个真实数据集上对所提模型的性能进行了评估。结果表明,与标准方法(如带有动态时间扭曲聚类的 K -均值法、潜在增长模型和函数式高维数据聚类)相比,所提出的模型具有有效的性能。
{"title":"Balanced longitudinal data clustering with a copula kernel mixture model","authors":"Xi Zhang, Orla A. Murphy, Paul D. McNicholas","doi":"10.1002/cjs.11838","DOIUrl":"https://doi.org/10.1002/cjs.11838","url":null,"abstract":"<p>Many common clustering methods cannot be used for clustering balanced multivariate longitudinal data in cases where the covariance of variables is a function of the time points. In this article, a copula kernel mixture model (CKMM) is proposed for clustering data of this type. The CKMM is a finite mixture model that decomposes each mixture component's joint density function into a copula and marginal distribution functions. In this decomposition, the Gaussian copula is used due to its mathematical tractability and Gaussian kernel functions are used to estimate the marginal distributions. A generalized expectation-maximization algorithm is used to estimate the model parameters. The performance of the proposed model is assessed in a simulation study and on two real datasets. The proposed model is shown to have effective performance in comparison with standard methods, such as <span></span><math>\u0000 <mrow>\u0000 <mi>K</mi>\u0000 </mrow></math>-means with dynamic time warping clustering, latent growth models and functional high-dimensional data clustering.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11838","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143497168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we use e-values in the context of multiple hypothesis testing, assuming that the base tests produce independent, or sequential, e-values. Our simulation and empirical studies, as well as theoretical considerations, suggest that, under this assumption, our new algorithms are superior to the known algorithms using independent p-values and to our recent algorithms designed for e-values without the assumption of independence.
在本文中,我们在多重假设检验中使用 e 值,假设基本检验产生独立或连续的 e 值。我们的模拟和实证研究以及理论考虑表明,在这一假设下,我们的新算法优于使用独立 p 值的已知算法,也优于我们最近为不带独立性假设的 e 值设计的算法。
{"title":"True and false discoveries with independent and sequential e-values","authors":"Vladimir Vovk, Ruodu Wang","doi":"10.1002/cjs.11833","DOIUrl":"https://doi.org/10.1002/cjs.11833","url":null,"abstract":"<p>In this article, we use <i>e</i>-values in the context of multiple hypothesis testing, assuming that the base tests produce independent, or sequential, <i>e</i>-values. Our simulation and empirical studies, as well as theoretical considerations, suggest that, under this assumption, our new algorithms are superior to the known algorithms using independent <i>p</i>-values and to our recent algorithms designed for <i>e</i>-values without the assumption of independence.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"52 4","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142642392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article proposes multivariate copula models for hierarchical data. They account for two types of correlation: one is between variables measured on the same unit, and the other is a correlation between units in the same cluster. This model is used to carry out copula regression for hierarchical data that gives cluster-specific prediction curves. In the simple case where a cluster contains two units and where two variables are measured on each one, the new model is constructed with a -vine. The proposed copula density is expressed in terms of three copula families. When the copula families and the marginal distributions are normal, the model is equivalent to a normal linear mixed model with random cluster-specific intercepts. Methods to select the three copula families and to estimate their parameters are proposed. We perform Monte Carlo studies of the sampling properties of these estimators and of out-of-sample predictions. The new model is applied to a dataset on the marks of students in several schools.
{"title":"A new copula regression model for hierarchical data","authors":"Talagbe Gabin Akpo, Louis-Paul Rivest","doi":"10.1002/cjs.11830","DOIUrl":"10.1002/cjs.11830","url":null,"abstract":"<p>This article proposes multivariate copula models for hierarchical data. They account for two types of correlation: one is between variables measured on the same unit, and the other is a correlation between units in the same cluster. This model is used to carry out copula regression for hierarchical data that gives cluster-specific prediction curves. In the simple case where a cluster contains two units and where two variables are measured on each one, the new model is constructed with a <span></span><math>\u0000 <mrow>\u0000 <mi>D</mi>\u0000 </mrow></math>-vine. The proposed copula density is expressed in terms of three copula families. When the copula families and the marginal distributions are normal, the model is equivalent to a normal linear mixed model with random cluster-specific intercepts. Methods to select the three copula families and to estimate their parameters are proposed. We perform Monte Carlo studies of the sampling properties of these estimators and of out-of-sample predictions. The new model is applied to a dataset on the marks of students in several schools.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11830","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Epidemic trajectories can be substantially impacted by people modifying their behaviours in response to changes in their perceived risk of spreading or contracting the disease. However, most infectious disease models assume a stable population behaviour. We present a flexible new class of models, called behavioural change individual-level models (BC-ILMs), that incorporate both individual-level covariate information and a data-driven behavioural change effect. Focusing on spatial BC-ILMs, we consider four “alarm” functions to model the effect of behavioural change as a function of infection prevalence over time. Through simulation studies, we find that if behavioural change is present, using an alarm function, even if specified incorrectly, will result in an improvement in posterior predictive performance over a model that assumes stable population behaviour. The methods are applied to data from the 2001 U.K. foot and mouth disease epidemic. The results show some evidence of a behavioural change effect, although it may not meaningfully impact model fit compared to a simpler spatial ILM in this dataset.
人们会根据自己对传播或感染疾病风险的感知变化而改变自己的行为,从而对流行病的轨迹产生重大影响。然而,大多数传染病模型都假定人群行为是稳定的。我们提出了一类灵活的新模型,称为行为变化个体水平模型(BC-ILMs),其中包含个体水平协变量信息和数据驱动的行为变化效应。以空间 BC-ILM 为重点,我们考虑了四种 "报警 "函数,将行为变化的影响作为感染率随时间变化的函数进行建模。通过模拟研究,我们发现,如果存在行为变化,使用报警函数,即使指定不正确,也会比假定人口行为稳定的模型提高后验预测性能。这些方法被应用于 2001 年英国口蹄疫疫情数据。结果显示了一些行为变化效应的证据,尽管与该数据集中更简单的空间 ILM 相比,行为变化效应可能不会对模型拟合产生有意义的影响。
{"title":"A framework for incorporating behavioural change into individual-level spatial epidemic models","authors":"Madeline A. Ward, Rob Deardon, Lorna E. Deeth","doi":"10.1002/cjs.11828","DOIUrl":"10.1002/cjs.11828","url":null,"abstract":"<p>Epidemic trajectories can be substantially impacted by people modifying their behaviours in response to changes in their perceived risk of spreading or contracting the disease. However, most infectious disease models assume a stable population behaviour. We present a flexible new class of models, called behavioural change individual-level models (BC-ILMs), that incorporate both individual-level covariate information and a data-driven behavioural change effect. Focusing on spatial BC-ILMs, we consider four “alarm” functions to model the effect of behavioural change as a function of infection prevalence over time. Through simulation studies, we find that if behavioural change is present, using an alarm function, even if specified incorrectly, will result in an improvement in posterior predictive performance over a model that assumes stable population behaviour. The methods are applied to data from the 2001 U.K. foot and mouth disease epidemic. The results show some evidence of a behavioural change effect, although it may not meaningfully impact model fit compared to a simpler spatial ILM in this dataset.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11828","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider random sample splitting for estimation and inference in high-dimensional generalized linear models (GLMs), where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that a sample splitting procedure based on the debiased lasso yields asymptotically normal estimates under mild conditions and that multiple splitting can address the loss of efficiency. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood method in the estimation stage can vastly reduce the bias and variance of the resulting estimates. Furthermore, our multiple splitting debiased lasso method has better numerical performance than some existing methods for high-dimensional GLMs proposed in the recent literature. We illustrate the proposed multiple splitting method with an analysis of the smoking data of the Mid-South Tobacco Case–Control Study.
{"title":"Debiased lasso after sample splitting for estimation and inference in high-dimensional generalized linear models","authors":"Omar Vazquez, Bin Nan","doi":"10.1002/cjs.11827","DOIUrl":"10.1002/cjs.11827","url":null,"abstract":"<p>We consider random sample splitting for estimation and inference in high-dimensional generalized linear models (GLMs), where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that a sample splitting procedure based on the debiased lasso yields asymptotically normal estimates under mild conditions and that multiple splitting can address the loss of efficiency. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood method in the estimation stage can vastly reduce the bias and variance of the resulting estimates. Furthermore, our multiple splitting debiased lasso method has better numerical performance than some existing methods for high-dimensional GLMs proposed in the recent literature. We illustrate the proposed multiple splitting method with an analysis of the smoking data of the Mid-South Tobacco Case–Control Study.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11827","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangyuan Ye, Tingting Yu, Daniel A. Caroff, Susan S. Huang, Bo Zhang, Rui Wang
In many biomedical applications, there is a need to build risk-adjustment models based on clustered data. However, methods for variable selection that are applicable to clustered discrete data settings with a large number of candidate variables and potentially large cluster sizes are lacking. We develop a new variable selection approach that combines within-cluster resampling techniques with penalized likelihood methods to select variables for high-dimensional clustered data. We derive an upper bound on the expected number of falsely selected variables, demonstrate the oracle properties of the proposed method and evaluate the finite sample performance of the method through extensive simulations. We illustrate the proposed approach using a colon surgical site infection data set consisting of 39,468 individuals from 149 hospitals to build risk-adjustment models that account for both the main effects of various risk factors and their two-way interactions.
{"title":"Variable selection in modelling clustered data via within-cluster resampling","authors":"Shangyuan Ye, Tingting Yu, Daniel A. Caroff, Susan S. Huang, Bo Zhang, Rui Wang","doi":"10.1002/cjs.11824","DOIUrl":"10.1002/cjs.11824","url":null,"abstract":"<p>In many biomedical applications, there is a need to build risk-adjustment models based on clustered data. However, methods for variable selection that are applicable to clustered discrete data settings with a large number of candidate variables and potentially large cluster sizes are lacking. We develop a new variable selection approach that combines within-cluster resampling techniques with penalized likelihood methods to select variables for high-dimensional clustered data. We derive an upper bound on the expected number of falsely selected variables, demonstrate the oracle properties of the proposed method and evaluate the finite sample performance of the method through extensive simulations. We illustrate the proposed approach using a colon surgical site infection data set consisting of 39,468 individuals from 149 hospitals to build risk-adjustment models that account for both the main effects of various risk factors and their two-way interactions.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11824","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we develop an innovative, robust method for jointly analyzing longitudinal count and binary responses. The method is useful for bounding the influence of potential outliers in the data when estimating the model parameters. We use a log-linear model for the count response and a logistic regression model for the binary response, where the two response processes are linked through a set of association parameters. The asymptotic properties of the robust estimators are briefly studied. The empirical properties of the estimators are studied based on simulations. The study shows that the proposed estimators are approximately unbiased and also efficient when fitting a joint model to data contaminated with outliers. We also apply the proposed method to some real longitudinal survey data obtained from a health study.
{"title":"Joint analysis of longitudinal count and binary response data in the presence of outliers","authors":"Sanjoy Sinha","doi":"10.1002/cjs.11819","DOIUrl":"10.1002/cjs.11819","url":null,"abstract":"<p>In this article, we develop an innovative, robust method for jointly analyzing longitudinal count and binary responses. The method is useful for bounding the influence of potential outliers in the data when estimating the model parameters. We use a log-linear model for the count response and a logistic regression model for the binary response, where the two response processes are linked through a set of association parameters. The asymptotic properties of the robust estimators are briefly studied. The empirical properties of the estimators are studied based on simulations. The study shows that the proposed estimators are approximately unbiased and also efficient when fitting a joint model to data contaminated with outliers. We also apply the proposed method to some real longitudinal survey data obtained from a health study.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.11819","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article focuses on detecting change points in high-dimensional linear regression models with piecewise constant regression coefficients, moving beyond the conventional reliance on strict Gaussian or sub-Gaussian noise assumptions. In the face of real-world complexities, where noise often deviates into uncertain or heavy-tailed distributions, we propose two tailored algorithms: a dynamic programming algorithm (DPA) for improved localization accuracy, and a binary segmentation algorithm (BSA) optimized for computational efficiency. These solutions are designed to be flexible, catering to increasing sample sizes and data dimensions, and offer a robust estimation of change points without requiring specific moments of the noise distribution. The efficacy of DPA and BSA is thoroughly evaluated through extensive simulation studies and application to real datasets, showing their competitive edge in adaptability and performance.
{"title":"Robust change point detection for high-dimensional linear models with tolerance for outliers and heavy tails","authors":"Zhi Yang, Liwen Zhang, Siyu Sun, Bin Liu","doi":"10.1002/cjs.11826","DOIUrl":"10.1002/cjs.11826","url":null,"abstract":"<p>This article focuses on detecting change points in high-dimensional linear regression models with piecewise constant regression coefficients, moving beyond the conventional reliance on strict Gaussian or sub-Gaussian noise assumptions. In the face of real-world complexities, where noise often deviates into uncertain or heavy-tailed distributions, we propose two tailored algorithms: a dynamic programming algorithm (DPA) for improved localization accuracy, and a binary segmentation algorithm (BSA) optimized for computational efficiency. These solutions are designed to be flexible, catering to increasing sample sizes and data dimensions, and offer a robust estimation of change points without requiring specific moments of the noise distribution. The efficacy of DPA and BSA is thoroughly evaluated through extensive simulation studies and application to real datasets, showing their competitive edge in adaptability and performance.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Missing data reduce the representativeness of the sample and can lead to inference problems. In this article, we apply the Bayesian jackknife empirical likelihood (BJEL) method for inference on data that are missing at random, as well as for causal inference. The semiparametric fractional imputation estimator, propensity score-weighted estimator, and doubly robust estimator are used for constructing the jackknife pseudo values, which are needed for conducting BJEL-based inference with missing data. Existing methods, such as normal approximation and JEL, are compared with the BJEL approach in a simulation study. The proposed approach shows better performance in many scenarios in terms of credible intervals. Furthermore, we demonstrate the application of the proposed approach for causal inference problems in a study of risk factors for impaired kidney function.
{"title":"Bayesian jackknife empirical likelihood-based inference for missing data and causal inference","authors":"Sixia Chen, Yuke Wang, Yichuan Zhao","doi":"10.1002/cjs.11825","DOIUrl":"10.1002/cjs.11825","url":null,"abstract":"<p>Missing data reduce the representativeness of the sample and can lead to inference problems. In this article, we apply the Bayesian jackknife empirical likelihood (BJEL) method for inference on data that are missing at random, as well as for causal inference. The semiparametric fractional imputation estimator, propensity score-weighted estimator, and doubly robust estimator are used for constructing the jackknife pseudo values, which are needed for conducting BJEL-based inference with missing data. Existing methods, such as normal approximation and JEL, are compared with the BJEL approach in a simulation study. The proposed approach shows better performance in many scenarios in terms of credible intervals. Furthermore, we demonstrate the application of the proposed approach for causal inference problems in a study of risk factors for impaired kidney function.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}