Johanna de Haan-Ward, Douglas G. Woolford, Simon J. Bonner
Response-based sampling is often used in modelling rare events from large, imbalanced data for efficiency. When modelling the event with logistic regression, the sampling design may be adjusted for using sampling weights or an offset. We propose a stratified sampling design for modelling rare events with large data which improves on previous methods by providing unbiased estimates of the standard errors of the coefficients in a multiple logistic regression scenario. We use multiple intercepts to model the incidence in the sampled data, then adjust each intercept via a stratum-specific offset. Our simulations provide no evidence of bias in the estimated logistic regression coefficients or their standard errors. We apply this method to spatio-temporal, fine-scale human-caused fire occurrence modelling for a region in northwestern Ontario, Canada, illustrating how the stratified sampling approach results in more locally precise estimates of fire occurrence.
{"title":"Predicting rare events using training data from stratified sampling designs, with application to human-caused wildfire prediction","authors":"Johanna de Haan-Ward, Douglas G. Woolford, Simon J. Bonner","doi":"10.1002/cjs.70008","DOIUrl":"https://doi.org/10.1002/cjs.70008","url":null,"abstract":"<p>Response-based sampling is often used in modelling rare events from large, imbalanced data for efficiency. When modelling the event with logistic regression, the sampling design may be adjusted for using sampling weights or an offset. We propose a stratified sampling design for modelling rare events with large data which improves on previous methods by providing unbiased estimates of the standard errors of the coefficients in a multiple logistic regression scenario. We use multiple intercepts to model the incidence in the sampled data, then adjust each intercept via a stratum-specific offset. Our simulations provide no evidence of bias in the estimated logistic regression coefficients or their standard errors. We apply this method to spatio-temporal, fine-scale human-caused fire occurrence modelling for a region in northwestern Ontario, Canada, illustrating how the stratified sampling approach results in more locally precise estimates of fire occurrence.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 3","pages":""},"PeriodicalIF":1.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.70008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate the unified inference of a time-varying additive model under the quantile regression framework, considering both sparse and dense longitudinal or functional data. For convolution-type smoothed objective functions, we propose a two-step method for estimating both the trend and the component functions. Theoretical analysis shows that the two-step estimators share the same asymptotic distribution as the oracle estimators, while the convergence rates and limiting variance functions differ between sparse and dense situations. However, making a subjective choice between these two cases can lead to incorrect statistical inferences. To address this issue, we develop sandwich formulas for variance estimations. This allows us to establish a unified inference without the need to decide whether the data are sparse or dense. Via simulation studies, we assess the finite-sample performance of the proposed methods. Finally, analyses of two different types of real data illustrate our proposed methods.
{"title":"Unified inference for longitudinal/functional data quantile dynamic additive models","authors":"Qian Huang, Tao Li, Jinhong You, Liwen Zhang","doi":"10.1002/cjs.70006","DOIUrl":"https://doi.org/10.1002/cjs.70006","url":null,"abstract":"<p>We investigate the unified inference of a time-varying additive model under the quantile regression framework, considering both sparse and dense longitudinal or functional data. For convolution-type smoothed objective functions, we propose a two-step method for estimating both the trend and the component functions. Theoretical analysis shows that the two-step estimators share the same asymptotic distribution as the oracle estimators, while the convergence rates and limiting variance functions differ between sparse and dense situations. However, making a subjective choice between these two cases can lead to incorrect statistical inferences. To address this issue, we develop sandwich formulas for variance estimations. This allows us to establish a unified inference without the need to decide whether the data are sparse or dense. Via simulation studies, we assess the finite-sample performance of the proposed methods. Finally, analyses of two different types of real data illustrate our proposed methods.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 3","pages":""},"PeriodicalIF":1.0,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balancing data privacy with public access is critical for sensitive datasets. However, even after de-identification, the data are still vulnerable to, for example, inference attacks (by matching some keywords with external datasets). Statistical disclosure control (SDC) methods offer additional protection, and the post-randomization method (PRAM) adds noise to data to achieve this goal. However, PRAM-perturbed data pose challenges for analysis, as directly using the perturbed data leads to biased parameter estimates. This article addresses parameter estimation when data are perturbed using PRAM for privacy. While existing methods suffer from limitations like being parameter-specific, model-dependent and lacking optimality guarantees, our proposed method overcomes these limitations. Our approach applies to general parameters defined through estimating equations and makes no assumptions about the underlying data model. Furthermore, we prove that the proposed estimator achieves the semiparametric efficiency bound, making it asymptotically optimal in terms of estimation efficiency.
{"title":"Efficient and model-agnostic parameter estimation under privacy-preserving post-randomization data","authors":"Qinglong Tian, Jiwei Zhao","doi":"10.1002/cjs.70003","DOIUrl":"https://doi.org/10.1002/cjs.70003","url":null,"abstract":"<p>Balancing data privacy with public access is critical for sensitive datasets. However, even after de-identification, the data are still vulnerable to, for example, inference attacks (by matching some keywords with external datasets). Statistical disclosure control (SDC) methods offer additional protection, and the post-randomization method (PRAM) adds noise to data to achieve this goal. However, PRAM-perturbed data pose challenges for analysis, as directly using the perturbed data leads to biased parameter estimates. This article addresses parameter estimation when data are perturbed using PRAM for privacy. While existing methods suffer from limitations like being parameter-specific, model-dependent and lacking optimality guarantees, our proposed method overcomes these limitations. Our approach applies to general parameters defined through estimating equations and makes no assumptions about the underlying data model. Furthermore, we prove that the proposed estimator achieves the semiparametric efficiency bound, making it asymptotically optimal in terms of estimation efficiency.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 3","pages":""},"PeriodicalIF":1.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.70003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to “Matching distributions for survival data”","authors":"","doi":"10.1002/cjs.70007","DOIUrl":"https://doi.org/10.1002/cjs.70007","url":null,"abstract":"<p>Jiang, Q., Xia, Y., and Liang, B. (2022) Matching distributions for survival data. <i>The Canadian Journal of Statistics</i>, 50:751–775.</p><p>The name of the first author “Qiang JIANG” was incorrect. This should have been: “Qing JIANG”.</p><p>We apologize for this error.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 2","pages":""},"PeriodicalIF":0.8,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cjs.70007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is well known that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted, and that conditioning on such ancillary statistics reduces the dimension of the data without a loss of information. The use of ancillary statistics in post-data inference has received significant attention; however, their role in the design of experiments has not been well characterized. Ancillary statistics are not known prior to data collection and as a result cannot be incorporated into the design a priori. Conversely, in sequential experiments the ancillary statistics based on the data from the preceding observations are known and can be used to determine the design assignment of the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically the ancillary statistic that constitutes a relevant subset, into adaptive designs.
{"title":"Optimal relevant subset designs in nonlinear models","authors":"Adam Lane","doi":"10.1002/cjs.70004","DOIUrl":"https://doi.org/10.1002/cjs.70004","url":null,"abstract":"<p>It is well known that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted, and that conditioning on such ancillary statistics reduces the dimension of the data without a loss of information. The use of ancillary statistics in post-data inference has received significant attention; however, their role in the design of experiments has not been well characterized. Ancillary statistics are not known prior to data collection and as a result cannot be incorporated into the design a priori. Conversely, in sequential experiments the ancillary statistics based on the data from the preceding observations are known and can be used to determine the design assignment of the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically the ancillary statistic that constitutes a relevant subset, into adaptive designs.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 3","pages":""},"PeriodicalIF":1.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuoran Zhang, Olivia Bernstein Morgan, Daniel L. Gillen, for the Alzheimer's Disease Neuroimaging Initiative
Modern epidemiological studies are often characterized by extensive data collection, which facilitates building high-dimensional predictive models. With large samples often conveniently sampled, weighted penalized regression models are commonly applied to provide improved prediction. In this article, we empirically show that weighted ridge regression models may yield suboptimal results because of the lack of flexibility in the penalty structure. We propose a generalized weighted ridge regression (GWRR) estimation procedure that allows for the adjustment of sampling weights in the penalty structure. We derive the asymptotic properties of the proposed GWRR estimator and provide a computationally efficient closed-form solution. We demonstrate the performance of the proposed GWRR estimator and justify the asymptotic variance via simulation studies. Finally, we illustrate the utility of our proposed estimator through an application to the prediction of mini-mental state examination (MMSE) scores.
{"title":"Reweighted penalized regression for convenience samples","authors":"Zhuoran Zhang, Olivia Bernstein Morgan, Daniel L. Gillen, for the Alzheimer's Disease Neuroimaging Initiative","doi":"10.1002/cjs.70005","DOIUrl":"https://doi.org/10.1002/cjs.70005","url":null,"abstract":"<p>Modern epidemiological studies are often characterized by extensive data collection, which facilitates building high-dimensional predictive models. With large samples often conveniently sampled, weighted penalized regression models are commonly applied to provide improved prediction. In this article, we empirically show that weighted ridge regression models may yield suboptimal results because of the lack of flexibility in the penalty structure. We propose a generalized weighted ridge regression (GWRR) estimation procedure that allows for the adjustment of sampling weights in the penalty structure. We derive the asymptotic properties of the proposed GWRR estimator and provide a computationally efficient closed-form solution. We demonstrate the performance of the proposed GWRR estimator and justify the asymptotic variance via simulation studies. Finally, we illustrate the utility of our proposed estimator through an application to the prediction of mini-mental state examination (MMSE) scores.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 3","pages":""},"PeriodicalIF":1.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we consider the imputation of missing responses in a longitudinal dataset via matrix completion. We propose a fixed-effect, longitudinal, low-rank model that incorporates both subject-specific and time-specific covariates. To solve the optimization problem, a two-step optimization algorithm is proposed, which provides good statistical properties for the estimation of the fixed effects and the low-rank term. In a theoretical investigation, the non-asymptotic error bounds on the fixed effects and low-rank term are presented. We illustrate the finite-sample performance of the proposed algorithm via simulation studies, and apply our method to a power plant SO