Statisticians and epidemiologists generally cite the publications by Prentice & Breslow and by Breslow et al. in 1978 as the first description and use of conditional logistic regression, while economists cite the 1973 book chapter by Nobel laureate McFadden. We describe the until-now-unrecognized use of, and way of fitting, this model in 1934 by Lionel Penrose and Ronald Fisher.
{"title":"Studies in the history of probability and statistics, LI: the first conditional logistic regression","authors":"J A Hanley","doi":"10.1093/biomet/asae038","DOIUrl":"https://doi.org/10.1093/biomet/asae038","url":null,"abstract":"Statisticians and epidemiologists generally cite the publications by Prentice & Breslow and by Breslow et al. in 1978 as the first description and use of conditional logistic regression, while economists cite the 1973 book chapter by Nobel laureate McFadden. We describe the until-now-unrecognized use of, and way of fitting, this model in 1934 by Lionel Penrose and Ronald Fisher.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"116 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary One of the most important problems in precision medicine is to find the optimal individualized treatment rule, which is designed to recommend treatment decisions and maximize overall clinical benefit to patients based on their individual characteristics. Typically, the expected clinical outcome is required to be estimated first, in which an outcome regression model or a propensity score model usually needs to be assumed for most of the existing statistical methods. However, if either model assumption is invalid, the estimated treatment regime is not reliable. In this article, we first define a contrast value function, which is the basis of the study for individualized treatment regimes. Then we construct a hybrid estimator of the contrast value function, by combining two types of estimation methods. We further propose a robust covariate-balancing estimator of the contrast value function by combining the inverse probability weighted method and matching method, which is based on the covariate balancing propensity score proposed by Imai and Ratkovic (2014). Theoretical results show that the proposed estimator is doubly robust, that is, it is consistent if either the propensity score model or the matching is correct. Based on a large number of simulation studies, we demonstrate that the proposed estimator outperforms existing methods. Lastly, the proposed method is illustrated through analysis of the SUPPORT study.
摘要 精准医疗中最重要的问题之一是找到最佳个体化治疗规则,该规则旨在根据患者的个体特征推荐治疗决策,并使患者的总体临床获益最大化。通常情况下,首先需要估计预期临床结果,在此过程中,大多数现有统计方法通常需要假设结果回归模型或倾向评分模型。然而,如果任一模型假设无效,估计出的治疗方案就不可靠。在本文中,我们首先定义了对比值函数,这是研究个体化治疗方案的基础。然后,我们结合两种估计方法,构建了对比值函数的混合估计器。我们进一步结合反概率加权法和匹配法,在 Imai 和 Ratkovic(2014 年)提出的共变平衡倾向得分的基础上,提出了一种稳健的共变平衡对比值函数估计器。理论结果表明,所提出的估计器具有双重稳健性,即如果倾向得分模型或匹配正确,则估计器是一致的。基于大量的模拟研究,我们证明了所提出的估计方法优于现有方法。最后,我们通过对 SUPPORT 研究的分析来说明所提出的方法。
{"title":"Robust Covariate-Balancing Method in Learning Optimal Individualized Treatment Regimes","authors":"Canhui Li, Donglin Zeng, Wensheng Zhu","doi":"10.1093/biomet/asae036","DOIUrl":"https://doi.org/10.1093/biomet/asae036","url":null,"abstract":"Summary One of the most important problems in precision medicine is to find the optimal individualized treatment rule, which is designed to recommend treatment decisions and maximize overall clinical benefit to patients based on their individual characteristics. Typically, the expected clinical outcome is required to be estimated first, in which an outcome regression model or a propensity score model usually needs to be assumed for most of the existing statistical methods. However, if either model assumption is invalid, the estimated treatment regime is not reliable. In this article, we first define a contrast value function, which is the basis of the study for individualized treatment regimes. Then we construct a hybrid estimator of the contrast value function, by combining two types of estimation methods. We further propose a robust covariate-balancing estimator of the contrast value function by combining the inverse probability weighted method and matching method, which is based on the covariate balancing propensity score proposed by Imai and Ratkovic (2014). Theoretical results show that the proposed estimator is doubly robust, that is, it is consistent if either the propensity score model or the matching is correct. Based on a large number of simulation studies, we demonstrate that the proposed estimator outperforms existing methods. Lastly, the proposed method is illustrated through analysis of the SUPPORT study.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"337 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141740777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
AmirEmad Ghassami, Alan Yang, Ilya Shpitser, Eric Tchetgen Tchetgen
Summary Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) we establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish a hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.
{"title":"Causal inference with hidden mediators","authors":"AmirEmad Ghassami, Alan Yang, Ilya Shpitser, Eric Tchetgen Tchetgen","doi":"10.1093/biomet/asae037","DOIUrl":"https://doi.org/10.1093/biomet/asae037","url":null,"abstract":"Summary Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) we establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish a hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"249 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary It is conventionally believed that permutation-based testing methods should ideally use all permutations. We challenge this by showing we can sometimes obtain dramatically more power by using a tiny subgroup. As the subgroup is tiny, this also comes at a much lower computational cost. Moreover, the method remains valid for the same hypotheses. We exploit this to improve the popular permutation-based Westfall & Young MaxT multiple testing method. We analyze the relative efficiency in a Gaussian location model, and find the largest gain in high dimensions.
摘要 传统观点认为,基于排列的检验方法最好使用所有排列。我们对这一观点提出了质疑,因为我们发现有时使用一个很小的子群就能获得更强的能力。由于子群很小,因此计算成本也低得多。此外,这种方法对相同的假设依然有效。我们利用这一点改进了流行的基于置换的 Westfall & Young MaxT 多重检验方法。我们分析了高斯位置模型中的相对效率,发现在高维度中的收益最大。
{"title":"More Power by Using Fewer Permutations","authors":"Nick W Koning","doi":"10.1093/biomet/asae031","DOIUrl":"https://doi.org/10.1093/biomet/asae031","url":null,"abstract":"Summary It is conventionally believed that permutation-based testing methods should ideally use all permutations. We challenge this by showing we can sometimes obtain dramatically more power by using a tiny subgroup. As the subgroup is tiny, this also comes at a much lower computational cost. Moreover, the method remains valid for the same hypotheses. We exploit this to improve the popular permutation-based Westfall & Young MaxT multiple testing method. We analyze the relative efficiency in a Gaussian location model, and find the largest gain in high dimensions.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"377 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary With the advance of science and technology, more and more data are collected in the form of functions. A fundamental question for a pair of random functions is to test whether they are independent. This problem becomes quite challenging when the random trajectories are sampled irregularly and sparsely for each subject. In other words, each random function is only sampled at a few time-points, and these time-points vary with subjects. Furthermore, the observed data may contain noise. To the best of our knowledge, there exists no consistent test in the literature to test the independence of sparsely observed functional data. We show in this work that testing pointwise independence simultaneously is feasible. The test statistics are constructed by integrating pointwise distance covariances (Székely et al., 2007) and are shown to converge, at a certain rate, to their corresponding population counterparts, which characterize the simultaneous pointwise independence of two random functions. The performance of the proposed methods is further verified by Monte Carlo simulations and analysis of real data.
摘要 随着科学技术的发展,越来越多的数据以函数的形式被收集起来。一对随机函数的基本问题是测试它们是否独立。如果对每个受试者的随机轨迹进行不规则的稀疏采样,这个问题就变得相当具有挑战性。换句话说,每个随机函数只在几个时间点上采样,而这些时间点会随着受试者的不同而变化。此外,观察到的数据可能包含噪声。据我们所知,文献中没有一致的测试方法来测试稀疏观测功能数据的独立性。我们在这项工作中证明,同时测试点独立性是可行的。测试统计量是通过积分点距协方差(Székely et al.蒙特卡罗模拟和真实数据分析进一步验证了所提方法的性能。
{"title":"Testing Independence for Sparse Longitudinal Data","authors":"Changbo Zhu, Junwen Yao, Jane-Ling Wang","doi":"10.1093/biomet/asae035","DOIUrl":"https://doi.org/10.1093/biomet/asae035","url":null,"abstract":"Summary With the advance of science and technology, more and more data are collected in the form of functions. A fundamental question for a pair of random functions is to test whether they are independent. This problem becomes quite challenging when the random trajectories are sampled irregularly and sparsely for each subject. In other words, each random function is only sampled at a few time-points, and these time-points vary with subjects. Furthermore, the observed data may contain noise. To the best of our knowledge, there exists no consistent test in the literature to test the independence of sparsely observed functional data. We show in this work that testing pointwise independence simultaneously is feasible. The test statistics are constructed by integrating pointwise distance covariances (Székely et al., 2007) and are shown to converge, at a certain rate, to their corresponding population counterparts, which characterize the simultaneous pointwise independence of two random functions. The performance of the proposed methods is further verified by Monte Carlo simulations and analysis of real data.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"18 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141566703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary We explore how much knowing a parametric restriction on propensity scores improves semiparametric efficiency bounds in the potential outcome framework. For stratified propensity scores, considered as a parametric model, we derive explicit formulas for the efficiency gain from knowing how the covariate space is split. Based on these, we find that the efficiency gain decreases as the partition of the stratification becomes finer. For general parametric models, where it is hard to obtain explicit representations of efficiency bounds, we propose a novel framework that enables us to see whether knowing a parametric model is valuable in terms of efficiency even when it is high-dimensional. In addition to the intuitive fact that knowing the parametric model does not help much if it is sufficiently flexible, we discover that the efficiency gain can be nearly zero even though the parametric assumption significantly restricts the space of possible propensity scores.
{"title":"Semiparametric efficiency gains from parametric restrictions on propensity scores","authors":"Haruki Kono","doi":"10.1093/biomet/asae034","DOIUrl":"https://doi.org/10.1093/biomet/asae034","url":null,"abstract":"Summary We explore how much knowing a parametric restriction on propensity scores improves semiparametric efficiency bounds in the potential outcome framework. For stratified propensity scores, considered as a parametric model, we derive explicit formulas for the efficiency gain from knowing how the covariate space is split. Based on these, we find that the efficiency gain decreases as the partition of the stratification becomes finer. For general parametric models, where it is hard to obtain explicit representations of efficiency bounds, we propose a novel framework that enables us to see whether knowing a parametric model is valuable in terms of efficiency even when it is high-dimensional. In addition to the intuitive fact that knowing the parametric model does not help much if it is sufficiently flexible, we discover that the efficiency gain can be nearly zero even though the parametric assumption significantly restricts the space of possible propensity scores.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141566705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lachlan C Astfalck, Adam M Sykulski, Edward J Cripps
Summary Welch’s method provides an estimator of the power spectral density that is statistically consistent. This is achieved by averaging over periodograms calculated from overlapping segments of a time series. For a finite length time series, while the variance of the estimator decreases as the number of segments increase, the magnitude of the estimator’s bias increases: a bias-variance trade-off ensues when setting the segment number. We address this issue by providing a novel method for debiasing Welch’s method which maintains the computational complexity and asymptotic consistency, and leads to improved finite-sample performance. Theoretical results are given for fourth-order stationary processes with finite fourth-order moments and absolutely convergent fourth-order cumulant function. The significant bias reduction is demonstrated with numerical simulation and an application to real-world data. Our estimator also permits irregular spacing over frequency and we demonstrate how this may be employed for signal compression and further variance reduction. Code accompanying this work is available in R and python.
摘要 韦尔奇方法提供了一种统计上一致的功率谱密度估算器。这是通过对时间序列的重叠片段计算出的周期图求取平均值来实现的。对于有限长度的时间序列,虽然估计器的方差会随着分段数的增加而减小,但估计器的偏差幅度却会增大:在设定分段数时,偏差与方差之间会产生权衡。为了解决这个问题,我们提供了一种新的韦尔奇去偏方法,这种方法既能保持计算复杂性和渐进一致性,又能改善有限样本性能。该方法给出了具有有限四阶矩和绝对收敛四阶累积函数的四阶静止过程的理论结果。通过数值模拟和实际数据的应用,证明了偏差的显著减少。我们的估计器还允许频率上的不规则间隔,并演示了如何将其用于信号压缩和进一步降低方差。本研究的相关代码采用 R 和 python 语言编写。
{"title":"Debiasing Welch’s Method for Spectral Density Estimation","authors":"Lachlan C Astfalck, Adam M Sykulski, Edward J Cripps","doi":"10.1093/biomet/asae033","DOIUrl":"https://doi.org/10.1093/biomet/asae033","url":null,"abstract":"Summary Welch’s method provides an estimator of the power spectral density that is statistically consistent. This is achieved by averaging over periodograms calculated from overlapping segments of a time series. For a finite length time series, while the variance of the estimator decreases as the number of segments increase, the magnitude of the estimator’s bias increases: a bias-variance trade-off ensues when setting the segment number. We address this issue by providing a novel method for debiasing Welch’s method which maintains the computational complexity and asymptotic consistency, and leads to improved finite-sample performance. Theoretical results are given for fourth-order stationary processes with finite fourth-order moments and absolutely convergent fourth-order cumulant function. The significant bias reduction is demonstrated with numerical simulation and an application to real-world data. Our estimator also permits irregular spacing over frequency and we demonstrate how this may be employed for signal compression and further variance reduction. Code accompanying this work is available in R and python.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"7 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In practice, it is common for collected data to be underreported, which is particularly prevalent in fields such as social sciences, ecology and epidemiology. Drawing inferences from such data using conventional statistical methods can lead to incorrect conclusions. In this paper, we study tests for serial or cross dependence in time series data that are subject to underreporting. We introduce new test statistics, develop corresponding group-of-blocks bootstrap techniques, and establish their consistency. The methods are shown to be efficient by simulation and are used to identify key factors responsible for the spread of dengue fever and the occurrence of cardiovascular disease.
{"title":"Testing serial dependence or cross dependence for time series with underreporting","authors":"Keyao Wei, Lengyang Wang, Yingcun Xia","doi":"10.1093/biomet/asae027","DOIUrl":"https://doi.org/10.1093/biomet/asae027","url":null,"abstract":"In practice, it is common for collected data to be underreported, which is particularly prevalent in fields such as social sciences, ecology and epidemiology. Drawing inferences from such data using conventional statistical methods can lead to incorrect conclusions. In this paper, we study tests for serial or cross dependence in time series data that are subject to underreporting. We introduce new test statistics, develop corresponding group-of-blocks bootstrap techniques, and establish their consistency. The methods are shown to be efficient by simulation and are used to identify key factors responsible for the spread of dengue fever and the occurrence of cardiovascular disease.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"197 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary We consider the problem of independence testing for two univariate random variables in a sequential setting. By leveraging recent developments on safe, anytime-valid inference, we propose a test with time-uniform type I error control and derive explicit bounds on the finite sample performance of the test. We demonstrate the empirical performance of the procedure in comparison to existing sequential and non-sequential independence tests. Furthermore, since the proposed test is distribution free under the null hypothesis, we empirically simulate the gap due to Ville’s inequality–the supermartingale analogue of Markov’s inequality–that is commonly applied to control type I error in anytime-valid inference, and apply this to construct a truncated sequential test.
摘要 我们考虑的问题是在连续环境中对两个单变量随机变量进行独立性检验。通过利用最近在安全、随时有效推断方面的发展,我们提出了一种具有时间均匀 I 型误差控制的检验,并推导出了检验的有限样本性能的明确界限。与现有的顺序和非顺序独立性检验相比,我们证明了该程序的经验性能。此外,由于所提出的检验在零假设下是无分布的,因此我们根据经验模拟了 Ville 不等式--即马尔可夫不等式的超马尔可夫不等式--导致的差距,该不等式通常用于控制任意时间有效推断中的 I 型误差,并将其应用于构建截断序列检验。
{"title":"A Rank-Based Sequential Test of Independence","authors":"Alexander Henzi, Michael Law","doi":"10.1093/biomet/asae023","DOIUrl":"https://doi.org/10.1093/biomet/asae023","url":null,"abstract":"Summary We consider the problem of independence testing for two univariate random variables in a sequential setting. By leveraging recent developments on safe, anytime-valid inference, we propose a test with time-uniform type I error control and derive explicit bounds on the finite sample performance of the test. We demonstrate the empirical performance of the procedure in comparison to existing sequential and non-sequential independence tests. Furthermore, since the proposed test is distribution free under the null hypothesis, we empirically simulate the gap due to Ville’s inequality–the supermartingale analogue of Markov’s inequality–that is commonly applied to control type I error in anytime-valid inference, and apply this to construct a truncated sequential test.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"23 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary We propose a model-free variable screening method for the optimal treatment regime with high-dimensional survival data. The proposed screening method provides a unified framework to select the active variables in a prespecified target population, including the treated group as a special case. Based on this framework, the optimal treatment regime is exactly the optimal classifier that minimizes a weighted misclassification error rate, with weights associated with survival outcome variables, the censoring distribution, and a prespecified target population. Our main contribution involves reformulating the weighted classification problem into a classification problem within a hypothetical population, where the observed data can be viewed as a sample obtained from outcome-dependent sampling, with the selection probability inversely proportional to the weights. Consequently, we introduce the weighted Kolmogorov–Smirnov approach for selecting active variables in the optimal treatment regime, extending the conventional Kolmogorov–Smirnov method for binary classification. Additionally, the proposed screening method exhibits two levels of robustness. The first level of robustness is achieved because the proposed method does not require any model assumptions for survival outcome on treatment and covariates, whereas the other is attained as the form of treatment regimes is allowed to be unspecified even without requiring convex surrogate loss, such as logit loss or hinge loss. As a result, the proposed screening method is robust to model misspecifications, and nonparametric learning methods such as random forests and boosting can be applied to those selected variables for further analysis. The theoretical properties of the proposed method are established. The performance of the proposed method is examined through simulation studies and illustrated by a real dataset.
{"title":"A model-free variable screening method for optimal treatment regimes with high-dimensional survival data","authors":"Cheng-Han Yang, Yu-Jen Cheng","doi":"10.1093/biomet/asae022","DOIUrl":"https://doi.org/10.1093/biomet/asae022","url":null,"abstract":"Summary We propose a model-free variable screening method for the optimal treatment regime with high-dimensional survival data. The proposed screening method provides a unified framework to select the active variables in a prespecified target population, including the treated group as a special case. Based on this framework, the optimal treatment regime is exactly the optimal classifier that minimizes a weighted misclassification error rate, with weights associated with survival outcome variables, the censoring distribution, and a prespecified target population. Our main contribution involves reformulating the weighted classification problem into a classification problem within a hypothetical population, where the observed data can be viewed as a sample obtained from outcome-dependent sampling, with the selection probability inversely proportional to the weights. Consequently, we introduce the weighted Kolmogorov–Smirnov approach for selecting active variables in the optimal treatment regime, extending the conventional Kolmogorov–Smirnov method for binary classification. Additionally, the proposed screening method exhibits two levels of robustness. The first level of robustness is achieved because the proposed method does not require any model assumptions for survival outcome on treatment and covariates, whereas the other is attained as the form of treatment regimes is allowed to be unspecified even without requiring convex surrogate loss, such as logit loss or hinge loss. As a result, the proposed screening method is robust to model misspecifications, and nonparametric learning methods such as random forests and boosting can be applied to those selected variables for further analysis. The theoretical properties of the proposed method are established. The performance of the proposed method is examined through simulation studies and illustrated by a real dataset.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":"46 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}