Summary This paper introduces an assumption-lean method that constructs valid and efficient lower predictive bounds (LPBs) for survival times with censored data.We build on recent work by Candès et al. (2021), whose approach first subsets the data to discard any data points with early censoring times, and then uses a reweighting technique (namely, weighted conformal inference (Tibshirani et al., 2019)) to correct for the distribution shift introduced by this subsetting procedure. For our new method, instead of constraining to a fixed threshold for the censoring time when subsetting the data, we allow for a covariate-dependent and data-adaptive subsetting step, which is better able to capture the heterogeneity of the censoring mechanism. As a result, our method can lead to LPBs that are less conservative and give more accurate information. We show that in the Type I right-censoring setting, if either of the censoring mechanism or the conditional quantile of survival time is well estimated, our proposed procedure achieves nearly exact marginal coverage, where in the latter case we additionally have approximate conditional coverage. We evaluate the validity and efficiency of our proposed algorithm in numerical experiments, illustrating its advantage when compared with other competing methods. Finally, our method is applied to a real dataset to generate LPBs for users’ active times on a mobile app.
{"title":"Conformalized survival analysis with adaptive cutoffs","authors":"Yu Gui, Rohan Hore, Zhimei Ren, Rina Foygel Barber","doi":"10.1093/biomet/asad076","DOIUrl":"https://doi.org/10.1093/biomet/asad076","url":null,"abstract":"Summary This paper introduces an assumption-lean method that constructs valid and efficient lower predictive bounds (LPBs) for survival times with censored data.We build on recent work by Candès et al. (2021), whose approach first subsets the data to discard any data points with early censoring times, and then uses a reweighting technique (namely, weighted conformal inference (Tibshirani et al., 2019)) to correct for the distribution shift introduced by this subsetting procedure. For our new method, instead of constraining to a fixed threshold for the censoring time when subsetting the data, we allow for a covariate-dependent and data-adaptive subsetting step, which is better able to capture the heterogeneity of the censoring mechanism. As a result, our method can lead to LPBs that are less conservative and give more accurate information. We show that in the Type I right-censoring setting, if either of the censoring mechanism or the conditional quantile of survival time is well estimated, our proposed procedure achieves nearly exact marginal coverage, where in the latter case we additionally have approximate conditional coverage. We evaluate the validity and efficiency of our proposed algorithm in numerical experiments, illustrating its advantage when compared with other competing methods. Finally, our method is applied to a real dataset to generate LPBs for users’ active times on a mobile app.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"5 3","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Thompson, Catherine S Forbes, Steven N Maceachern, Mario Peruggia
Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions, often concerning their centre. Tests that assess statistical hypotheses of centre implicitly assume a specific centre, e.g., the mean or median. Yet, scientific hypotheses do not always specify a particular centre. This ambiguity leaves the possibility for a gap between scientific theory and statistical practice that can lead to rejection of a true null. In the face of replicability crises in many scientific disciplines, significant results of this kind are concerning. Rather than testing a single centre, this paper proposes testing a family of plausible centres, such as that induced by the Huber loss function. Each centre in the family generates a testing problem, and the resulting family of hypotheses constitutes a familial hypothesis. A Bayesian nonparametric procedure is devised to test familial hypotheses, enabled by a novel pathwise optimization routine to fit the Huber family. The favourable properties of the new test are demonstrated theoretically and experimentally. Two examples from psychology serve as real-world case studies.
{"title":"Familial inference: Tests for hypotheses on a family of centres","authors":"Ryan Thompson, Catherine S Forbes, Steven N Maceachern, Mario Peruggia","doi":"10.1093/biomet/asad074","DOIUrl":"https://doi.org/10.1093/biomet/asad074","url":null,"abstract":"Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions, often concerning their centre. Tests that assess statistical hypotheses of centre implicitly assume a specific centre, e.g., the mean or median. Yet, scientific hypotheses do not always specify a particular centre. This ambiguity leaves the possibility for a gap between scientific theory and statistical practice that can lead to rejection of a true null. In the face of replicability crises in many scientific disciplines, significant results of this kind are concerning. Rather than testing a single centre, this paper proposes testing a family of plausible centres, such as that induced by the Huber loss function. Each centre in the family generates a testing problem, and the resulting family of hypotheses constitutes a familial hypothesis. A Bayesian nonparametric procedure is devised to test familial hypotheses, enabled by a novel pathwise optimization routine to fit the Huber family. The favourable properties of the new test are demonstrated theoretically and experimentally. Two examples from psychology serve as real-world case studies.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"9 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Interval-censored multistate data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We relate potentially time-dependent covariates to multistate processes through semiparametric proportional intensity models with random effects. We study nonparametric maximum likelihood estimation under general interval censoring and develop a stable expectation-maximization algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.
{"title":"Maximum Likelihood Estimation for Semiparametric Regression Models with Interval-Censored Multistate Data","authors":"Yu Gu, Donglin Zeng, Gerardo Heiss, D Y Lin","doi":"10.1093/biomet/asad073","DOIUrl":"https://doi.org/10.1093/biomet/asad073","url":null,"abstract":"Summary Interval-censored multistate data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We relate potentially time-dependent covariates to multistate processes through semiparametric proportional intensity models with random effects. We study nonparametric maximum likelihood estimation under general interval censoring and develop a stable expectation-maximization algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"16 6","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Varimax factor rotations, while popular among practitioners in psychology and statistics since being introduced by H.Kaiser, have historically been viewed with skepticism and suspicion by some theoreticians and mathematical statisticians. Now, work by K. Rohe and M. Zeng provides new, fundamental insight: varimax rotations provably perform statistical estimation in certain classes of latent variable models when paired with spectral-based matrix truncations for dimensionality reduction. We build on this new-found understanding of varimax rotations by developing further connections to network analysis and spectral methods rooted in entrywise matrix perturbation analysis. Concretely, this paper establishes the asymptotic multivariate normality of vectors in varimax-transformed Euclidean point clouds that represent low-dimensional node embeddings in certain latent space random graph models. We address related concepts including network sparsity, data denoising, and the role of matrix rank in latent variable parameterizations. Collectively, these findings, at the confluence of classical and contemporary multivariate analysis, reinforce methodology and inference procedures grounded in matrix factorization-based techniques. Numerical examples illustrate our findings and supplement our discussion.
{"title":"On varimax asymptotics in network models and spectral methods for dimensionality reduction","authors":"J Cape","doi":"10.1093/biomet/asad061","DOIUrl":"https://doi.org/10.1093/biomet/asad061","url":null,"abstract":"Summary Varimax factor rotations, while popular among practitioners in psychology and statistics since being introduced by H.Kaiser, have historically been viewed with skepticism and suspicion by some theoreticians and mathematical statisticians. Now, work by K. Rohe and M. Zeng provides new, fundamental insight: varimax rotations provably perform statistical estimation in certain classes of latent variable models when paired with spectral-based matrix truncations for dimensionality reduction. We build on this new-found understanding of varimax rotations by developing further connections to network analysis and spectral methods rooted in entrywise matrix perturbation analysis. Concretely, this paper establishes the asymptotic multivariate normality of vectors in varimax-transformed Euclidean point clouds that represent low-dimensional node embeddings in certain latent space random graph models. We address related concepts including network sparsity, data denoising, and the role of matrix rank in latent variable parameterizations. Collectively, these findings, at the confluence of classical and contemporary multivariate analysis, reinforce methodology and inference procedures grounded in matrix factorization-based techniques. Numerical examples illustrate our findings and supplement our discussion.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"19 3","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary A recent article on generalised linear mixed model asymptotics, Jiang et al. (2022), derived the rates of convergence for the asymptotic variances of maximum likelihood estimators. If m denotes the number of groups and n is the average within-group sample size then the asymptotic variances have orders m − 1 and (mn)−1, depending on the parameter. We extend this theory to provide explicit forms of the (mn)−1 second terms of the asymptotically harder-to-estimate parameters. Improved accuracy of statistical inference and planning are consequences of our theory.
{"title":"Second term improvement to generalised linear mixed model asymptotics","authors":"Luca Maestrini, Aishwarya Bhaskaran, Matt P Wand","doi":"10.1093/biomet/asad072","DOIUrl":"https://doi.org/10.1093/biomet/asad072","url":null,"abstract":"Summary A recent article on generalised linear mixed model asymptotics, Jiang et al. (2022), derived the rates of convergence for the asymptotic variances of maximum likelihood estimators. If m denotes the number of groups and n is the average within-group sample size then the asymptotic variances have orders m − 1 and (mn)−1, depending on the parameter. We extend this theory to provide explicit forms of the (mn)−1 second terms of the asymptotically harder-to-estimate parameters. Improved accuracy of statistical inference and planning are consequences of our theory.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"9 5","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Testing independence between high dimensional random vectors is fundamentally different from testing independence between univariate random variables. Take the projection correlation as an example. It suffers from at least three issues. First, it has a high computational complexity of O{n3 (p + q)}, where n, p and q are the respective sample size and dimensions of the random vectors. This limits its usefulness substantially when n is extremely large. Second, the asymptotic null distribution of the projection correlation test is rarely tractable. Therefore, random permutations are often suggested to approximate the asymptotic null distribution. This further increases the complexity of implementing independence tests. Last, the power performance of the projection correlation test deteriorates in high dimensions. To address these issues, we improve the projection correlation through a modified weight function, which reduces the complexity to O{n2 (p + q)}. We estimate the improved projection correlation with U-statistic theory. More importantly, its asymptotic null distribution is standard normal, thanks to the high dimensions of random vectors. This expedites the implementation of independence tests substantially. To enhance power performance in high dimensions, we introduce a cross-validation procedure which incorporates feature screening with the projection correlation test. The implementation efficacy and power enhancement are confirmed through extensive numerical studies.
{"title":"Projective Independence Tests in High Dimensions: the Curses and the Cures","authors":"Yaowu Zhang, Liping Zhu","doi":"10.1093/biomet/asad070","DOIUrl":"https://doi.org/10.1093/biomet/asad070","url":null,"abstract":"Summary Testing independence between high dimensional random vectors is fundamentally different from testing independence between univariate random variables. Take the projection correlation as an example. It suffers from at least three issues. First, it has a high computational complexity of O{n3 (p + q)}, where n, p and q are the respective sample size and dimensions of the random vectors. This limits its usefulness substantially when n is extremely large. Second, the asymptotic null distribution of the projection correlation test is rarely tractable. Therefore, random permutations are often suggested to approximate the asymptotic null distribution. This further increases the complexity of implementing independence tests. Last, the power performance of the projection correlation test deteriorates in high dimensions. To address these issues, we improve the projection correlation through a modified weight function, which reduces the complexity to O{n2 (p + q)}. We estimate the improved projection correlation with U-statistic theory. More importantly, its asymptotic null distribution is standard normal, thanks to the high dimensions of random vectors. This expedites the implementation of independence tests substantially. To enhance power performance in high dimensions, we introduce a cross-validation procedure which incorporates feature screening with the projection correlation test. The implementation efficacy and power enhancement are confirmed through extensive numerical studies.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"11 6","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discussion of ‘Statistical inference for streamed longitudinal data’","authors":"J. Wang, H. Wang, K. Chen","doi":"10.1093/biomet/asad035","DOIUrl":"https://doi.org/10.1093/biomet/asad035","url":null,"abstract":"","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"4 4","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139274543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: the comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.
{"title":"Generalized kernel two-sample tests","authors":"Hoseung Song, Hao Chen","doi":"10.1093/biomet/asad068","DOIUrl":"https://doi.org/10.1093/biomet/asad068","url":null,"abstract":"Summary Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: the comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"114 19","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134957329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}