Summary Testing independence between high dimensional random vectors is fundamentally different from testing independence between univariate random variables. Take the projection correlation as an example. It suffers from at least three issues. First, it has a high computational complexity of O{n3 (p + q)}, where n, p and q are the respective sample size and dimensions of the random vectors. This limits its usefulness substantially when n is extremely large. Second, the asymptotic null distribution of the projection correlation test is rarely tractable. Therefore, random permutations are often suggested to approximate the asymptotic null distribution. This further increases the complexity of implementing independence tests. Last, the power performance of the projection correlation test deteriorates in high dimensions. To address these issues, we improve the projection correlation through a modified weight function, which reduces the complexity to O{n2 (p + q)}. We estimate the improved projection correlation with U-statistic theory. More importantly, its asymptotic null distribution is standard normal, thanks to the high dimensions of random vectors. This expedites the implementation of independence tests substantially. To enhance power performance in high dimensions, we introduce a cross-validation procedure which incorporates feature screening with the projection correlation test. The implementation efficacy and power enhancement are confirmed through extensive numerical studies.
{"title":"Projective Independence Tests in High Dimensions: the Curses and the Cures","authors":"Yaowu Zhang, Liping Zhu","doi":"10.1093/biomet/asad070","DOIUrl":"https://doi.org/10.1093/biomet/asad070","url":null,"abstract":"Summary Testing independence between high dimensional random vectors is fundamentally different from testing independence between univariate random variables. Take the projection correlation as an example. It suffers from at least three issues. First, it has a high computational complexity of O{n3 (p + q)}, where n, p and q are the respective sample size and dimensions of the random vectors. This limits its usefulness substantially when n is extremely large. Second, the asymptotic null distribution of the projection correlation test is rarely tractable. Therefore, random permutations are often suggested to approximate the asymptotic null distribution. This further increases the complexity of implementing independence tests. Last, the power performance of the projection correlation test deteriorates in high dimensions. To address these issues, we improve the projection correlation through a modified weight function, which reduces the complexity to O{n2 (p + q)}. We estimate the improved projection correlation with U-statistic theory. More importantly, its asymptotic null distribution is standard normal, thanks to the high dimensions of random vectors. This expedites the implementation of independence tests substantially. To enhance power performance in high dimensions, we introduce a cross-validation procedure which incorporates feature screening with the projection correlation test. The implementation efficacy and power enhancement are confirmed through extensive numerical studies.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"11 6","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discussion of ‘Statistical inference for streamed longitudinal data’","authors":"J. Wang, H. Wang, K. Chen","doi":"10.1093/biomet/asad035","DOIUrl":"https://doi.org/10.1093/biomet/asad035","url":null,"abstract":"","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"4 4","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139274543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: the comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.
{"title":"Generalized kernel two-sample tests","authors":"Hoseung Song, Hao Chen","doi":"10.1093/biomet/asad068","DOIUrl":"https://doi.org/10.1093/biomet/asad068","url":null,"abstract":"Summary Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: the comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"114 19","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134957329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary We propose a novel method for testing serial independence of object-valued time series in metric spaces, which is more general than Euclidean or Hilbert spaces. The proposed method is fully nonparametric, free of tuning parameters and can capture all nonlinear pairwise dependence. The key concept used in this paper is the distance covariance in metric spaces, which is extended to auto-distance covariance for object-valued time series. Furthermore, we propose a generalized spectral density function to account for pairwise dependence at all lags and construct a Cramér von-Mises type test statistic. New theoretical arguments are developed to establish the asymptotic behaviour of the test statistic. A wild bootstrap is also introduced to obtain the critical values of the nonpivotal limiting null distribution. Extensive numerical simulations and two real data applications on cumulative intraday returns and human mortality data are conducted to illustrate the effectiveness and versatility of our proposed test.
{"title":"Testing Serial Independence of Object-Valued Time Series","authors":"Feiyu Jiang, Hanjia Gao, Xiaofeng Shao","doi":"10.1093/biomet/asad069","DOIUrl":"https://doi.org/10.1093/biomet/asad069","url":null,"abstract":"Summary We propose a novel method for testing serial independence of object-valued time series in metric spaces, which is more general than Euclidean or Hilbert spaces. The proposed method is fully nonparametric, free of tuning parameters and can capture all nonlinear pairwise dependence. The key concept used in this paper is the distance covariance in metric spaces, which is extended to auto-distance covariance for object-valued time series. Furthermore, we propose a generalized spectral density function to account for pairwise dependence at all lags and construct a Cramér von-Mises type test statistic. New theoretical arguments are developed to establish the asymptotic behaviour of the test statistic. A wild bootstrap is also introduced to obtain the critical values of the nonpivotal limiting null distribution. Extensive numerical simulations and two real data applications on cumulative intraday returns and human mortality data are conducted to illustrate the effectiveness and versatility of our proposed test.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"6 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135087094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Score-driven models have been recently introduced as a general framework to specify time-varying parameters of conditional densities. %The underlying idea is to specify a time-varying parameter as an autoregressive process with innovation given by the score of the associated log-likelihood. The score enjoys stochastic properties that make these models easy to implement and convenient to apply in several contexts, ranging from biostatistics to finance. Score-driven parameter updates have been shown to be optimal in terms of locally reducing a local version of the Kullback–Leibler divergence between the true conditional density and the postulated density of the model. A key limitation of such an optimality property is that it holds only locally both in the parameter space and sample space, yielding to a definition of local Kullback–Leibler divergence that is in fact not a divergence measure. The current paper shows that score-driven updates satisfy stronger optimality properties that are based on a global definition of Kullback–Leibler divergence. In particular, it is shown that score-driven updates reduce the distance between the expected updated parameter and the pseudo-true parameter. Furthermore, depending on the conditional density and the scaling of the score, the optimality result can hold globally over the parameter space, which can be viewed as a generalization of the monotonicity property of the stochastic gradient descent scheme. Several examples illustrate how the results derived in the paper apply to specific models under different easy-to-check assumptions, and provide a formal method to select the link-function and the scaling of the score.
{"title":"On the optimality of score-driven models","authors":"P Gorgi, C S A Lauria, A Luati","doi":"10.1093/biomet/asad067","DOIUrl":"https://doi.org/10.1093/biomet/asad067","url":null,"abstract":"Summary Score-driven models have been recently introduced as a general framework to specify time-varying parameters of conditional densities. %The underlying idea is to specify a time-varying parameter as an autoregressive process with innovation given by the score of the associated log-likelihood. The score enjoys stochastic properties that make these models easy to implement and convenient to apply in several contexts, ranging from biostatistics to finance. Score-driven parameter updates have been shown to be optimal in terms of locally reducing a local version of the Kullback–Leibler divergence between the true conditional density and the postulated density of the model. A key limitation of such an optimality property is that it holds only locally both in the parameter space and sample space, yielding to a definition of local Kullback–Leibler divergence that is in fact not a divergence measure. The current paper shows that score-driven updates satisfy stronger optimality properties that are based on a global definition of Kullback–Leibler divergence. In particular, it is shown that score-driven updates reduce the distance between the expected updated parameter and the pseudo-true parameter. Furthermore, depending on the conditional density and the scaling of the score, the optimality result can hold globally over the parameter space, which can be viewed as a generalization of the monotonicity property of the stochastic gradient descent scheme. Several examples illustrate how the results derived in the paper apply to specific models under different easy-to-check assumptions, and provide a formal method to select the link-function and the scaling of the score.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":" 28","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135291655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henckel, Leonard, Buttenschön, Martin, Maathuis, Marloes H.
Summary We consider the efficient estimation of total causal effects in the presence of unmeasured confounding using conditional instrumental sets. Specifically, we consider the two-stage least squares estimator in the setting of a linear structural equation model with correlated errors that is compatible with a known acyclic directed mixed graph. To set the stage for our results, we characterize the class of linearly valid conditional instrumental sets that yield consistent two-stage least squares estimators for the target total effect and derive a new asymptotic variance formula for these estimators. Equipped with these results, we provide three graphical tools for selecting more efficient linearly valid conditional instrumental sets. First, a graphical criterion that for certain pairs of linearly valid conditional instrumental sets identifies which of the two corresponding estimators has the smaller asymptotic variance. Second, an algorithm that greedily adds covariates that reduce the asymptotic variance to a given linearly valid conditional instrumental set. Third, a linearly valid conditional instrumental set for which the corresponding estimator has the smallest asymptotic variance that can be ensured with a graphical criterion.
{"title":"Graphical tools for selecting conditional instrumental sets","authors":"Henckel, Leonard, Buttenschön, Martin, Maathuis, Marloes H.","doi":"10.1093/biomet/asad066","DOIUrl":"https://doi.org/10.1093/biomet/asad066","url":null,"abstract":"Summary We consider the efficient estimation of total causal effects in the presence of unmeasured confounding using conditional instrumental sets. Specifically, we consider the two-stage least squares estimator in the setting of a linear structural equation model with correlated errors that is compatible with a known acyclic directed mixed graph. To set the stage for our results, we characterize the class of linearly valid conditional instrumental sets that yield consistent two-stage least squares estimators for the target total effect and derive a new asymptotic variance formula for these estimators. Equipped with these results, we provide three graphical tools for selecting more efficient linearly valid conditional instrumental sets. First, a graphical criterion that for certain pairs of linearly valid conditional instrumental sets identifies which of the two corresponding estimators has the smaller asymptotic variance. Second, an algorithm that greedily adds covariates that reduce the asymptotic variance to a given linearly valid conditional instrumental set. Third, a linearly valid conditional instrumental set for which the corresponding estimator has the smallest asymptotic variance that can be ensured with a graphical criterion.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"217 S694","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135775010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract Direct use of the likelihood function typically produces severely biased estimates when the dimension of the parameter vector is large relative to the effective sample size. With linearly separable data generated from a logistic regression model, the loglikelihood function asymptotes and the maximum likelihood estimator does not exist. We show that an exact analysis for each regression coefficient produces half-infinite confidence sets for some parameters when the data are separable. Such conclusions are not vacuous, but an honest portrayal of the limitations of the data. Finite confidence sets are only achievable when additional, perhaps implicit, assumptions are made. Under a notional double-asymptotic regime in which the dimension of the logistic coefficient vector increases with the sample size, the present paper considers the implications of enforcing a natural constraint on the vector of logistic-transformed probabilities. We derive a relationship between the logistic coefficients and a notional parameter obtained as a probability limit of an ordinary least squares estimator. The latter exists even when the data are separable. Consistency is ascertained under weak conditions on the design matrix.
{"title":"On inference in high-dimensional logistic regression models with separated data","authors":"R M Lewis, H S Battey","doi":"10.1093/biomet/asad065","DOIUrl":"https://doi.org/10.1093/biomet/asad065","url":null,"abstract":"Abstract Direct use of the likelihood function typically produces severely biased estimates when the dimension of the parameter vector is large relative to the effective sample size. With linearly separable data generated from a logistic regression model, the loglikelihood function asymptotes and the maximum likelihood estimator does not exist. We show that an exact analysis for each regression coefficient produces half-infinite confidence sets for some parameters when the data are separable. Such conclusions are not vacuous, but an honest portrayal of the limitations of the data. Finite confidence sets are only achievable when additional, perhaps implicit, assumptions are made. Under a notional double-asymptotic regime in which the dimension of the logistic coefficient vector increases with the sample size, the present paper considers the implications of enforcing a natural constraint on the vector of logistic-transformed probabilities. We derive a relationship between the logistic coefficients and a notional parameter obtained as a probability limit of an ordinary least squares estimator. The latter exists even when the data are separable. Consistency is ascertained under weak conditions on the design matrix.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135975796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Modelling of the dependence structure across heterogeneous data is crucial for Bayesian inference since it directly impacts the borrowing of information. Despite the extensive advances over the last two decades, most available proposals only allow for nonnegative correlations. We derive a new class of dependent nonparametric priors that can induce correlations of any sign, thus introducing a new and more flexible idea of borrowing of information. This is achieved thanks to a novel concept, which we term hyper-tie, and represents a direct and simple measure of dependence. We investigate prior and posterior distributional properties of the model and develop algorithms to perform posterior inference. Illustrative examples on simulated and real data show that our proposal outperforms alternatives in terms of prediction and clustering.
{"title":"Nonparametric priors with full-range borrowing of information","authors":"F Ascolani, B Franzolini, A Lijoi, I Prünster","doi":"10.1093/biomet/asad063","DOIUrl":"https://doi.org/10.1093/biomet/asad063","url":null,"abstract":"Summary Modelling of the dependence structure across heterogeneous data is crucial for Bayesian inference since it directly impacts the borrowing of information. Despite the extensive advances over the last two decades, most available proposals only allow for nonnegative correlations. We derive a new class of dependent nonparametric priors that can induce correlations of any sign, thus introducing a new and more flexible idea of borrowing of information. This is achieved thanks to a novel concept, which we term hyper-tie, and represents a direct and simple measure of dependence. We investigate prior and posterior distributional properties of the model and develop algorithms to perform posterior inference. Illustrative examples on simulated and real data show that our proposal outperforms alternatives in terms of prediction and clustering.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135729762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary Likelihood-based inference under nonconvex constraints on model parameters has become increasingly common in biomedical research. In this paper, we establish large-sample properties of the maximum likelihood estimator when the true parameter value lies at the boundary of a nonconvex parameter space. We further derive the asymptotic distribution of the likelihood ratio test statistic under nonconvex constraints on model parameters. A general Monte Carlo procedure for generating the limiting distribution is provided. The theoretical results are demonstrated by five examples in Anderson’s stereotype logistic regression model, genetic association studies, gene-environment interaction tests, cost-constrained linear regression and fairness-constrained linear regression.
{"title":"Likelihood-based Inference under Non-Convex Boundary Constraints","authors":"J Y Wang, Z S YE, Y Chen","doi":"10.1093/biomet/asad062","DOIUrl":"https://doi.org/10.1093/biomet/asad062","url":null,"abstract":"Summary Likelihood-based inference under nonconvex constraints on model parameters has become increasingly common in biomedical research. In this paper, we establish large-sample properties of the maximum likelihood estimator when the true parameter value lies at the boundary of a nonconvex parameter space. We further derive the asymptotic distribution of the likelihood ratio test statistic under nonconvex constraints on model parameters. A general Monte Carlo procedure for generating the limiting distribution is provided. The theoretical results are demonstrated by five examples in Anderson’s stereotype logistic regression model, genetic association studies, gene-environment interaction tests, cost-constrained linear regression and fairness-constrained linear regression.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135729962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}