Probability surveys are challenged by increasing nonresponse rates, resulting in biased statistical inference. Auxiliary information about populations can be used to reduce bias in estimation. Often continuous auxiliary variables in administrative records are first discretized before releasing to the public to avoid confidentiality breaches. This may weaken the utility of the administrative records in improving survey estimates, particularly when there is a strong relationship between continuous auxiliary information and the survey outcome. In this paper, we propose a two-step strategy, where the confidential continuous auxiliary data in the population are first utilized to estimate the response propensity score of the survey sample by statistical agencies, which is then included in a modified population data for data users. In the second step, data users who do not have access to confidential continuous auxiliary data conduct predictive survey inference by including discretized continuous variables and the propensity score as predictors using splines in a Bayesian model. We show by simulation that the proposed method performs well, yielding more efficient estimates of population means with 95% credible intervals providing better coverage than alternative approaches. We illustrate the proposed method using the Ohio Army National Guard Mental Health Initiative (OHARNG-MHI). The methods developed in this work are readily available in the R package AuxSurvey.
概率调查面临的挑战是无应答率越来越高,导致统计推断产生偏差。有关人口的辅助信息可用于减少估计中的偏差。通常情况下,行政记录中的连续辅助变量在向公众公布前会先被离散化,以避免泄密。这可能会削弱行政记录在改进调查估计方面的作用,尤其是当连续辅助信息与调查结果之间存在密切关系时。在本文中,我们提出了一种分两步走的策略,即首先由统计机构利用人口中的保密连续辅助数据估算调查样本的响应倾向得分,然后将其纳入修改后的人口数据中,供数据用户使用。在第二步中,无法获取保密连续辅助数据的数据用户将离散连续变量和倾向得分作为预测因子,利用贝叶斯模型中的样条进行预测性调查推断。我们通过仿真证明,与其他方法相比,所提出的方法性能良好,能更有效地估计人口均值,95% 可信区间的覆盖率更高。我们使用俄亥俄州陆军国民警卫队心理健康计划(OHARNG-MHI)对所提出的方法进行了说明。本研究中开发的方法可在 R 软件包 AuxSurvey 中找到。
{"title":"Improving Survey Inference Using Administrative Records Without Releasing Individual-Level Continuous Data.","authors":"Sharifa Z Williams, Jungang Zou, Yutao Liu, Yajuan Si, Sandro Galea, Qixuan Chen","doi":"10.1002/sim.10270","DOIUrl":"10.1002/sim.10270","url":null,"abstract":"<p><p>Probability surveys are challenged by increasing nonresponse rates, resulting in biased statistical inference. Auxiliary information about populations can be used to reduce bias in estimation. Often continuous auxiliary variables in administrative records are first discretized before releasing to the public to avoid confidentiality breaches. This may weaken the utility of the administrative records in improving survey estimates, particularly when there is a strong relationship between continuous auxiliary information and the survey outcome. In this paper, we propose a two-step strategy, where the confidential continuous auxiliary data in the population are first utilized to estimate the response propensity score of the survey sample by statistical agencies, which is then included in a modified population data for data users. In the second step, data users who do not have access to confidential continuous auxiliary data conduct predictive survey inference by including discretized continuous variables and the propensity score as predictors using splines in a Bayesian model. We show by simulation that the proposed method performs well, yielding more efficient estimates of population means with 95% credible intervals providing better coverage than alternative approaches. We illustrate the proposed method using the Ohio Army National Guard Mental Health Initiative (OHARNG-MHI). The methods developed in this work are readily available in the R package AuxSurvey.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5803-5813"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142669198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-12-01DOI: 10.1002/sim.10296
Lili Yu, Liang Liu
Independent censoring is usually assumed in survival data analysis. However, dependent censoring, where the survival time is dependent on the censoring time, is often seen in real data applications. In this project, we model the vector of survival time and censoring time marginally through semiparametric heteroscedastic accelerated failure time models and model their association by the vector of errors in the model. We show that this semiparametric model is identified, and the generalized estimating equation approach is extended to estimate the parameters in this model. It is shown that the estimators of the model parameters are consistent and asymptotically normal. Simulation studies are conducted to compare it with the estimation method under a parametric model. A real dataset from a prostate cancer study is used for illustration of the new proposed method.
{"title":"Generalized Estimating Equations for Survival Data With Dependent Censoring.","authors":"Lili Yu, Liang Liu","doi":"10.1002/sim.10296","DOIUrl":"10.1002/sim.10296","url":null,"abstract":"<p><p>Independent censoring is usually assumed in survival data analysis. However, dependent censoring, where the survival time is dependent on the censoring time, is often seen in real data applications. In this project, we model the vector of survival time and censoring time marginally through semiparametric heteroscedastic accelerated failure time models and model their association by the vector of errors in the model. We show that this semiparametric model is identified, and the generalized estimating equation approach is extended to estimate the parameters in this model. It is shown that the estimators of the model parameters are consistent and asymptotically normal. Simulation studies are conducted to compare it with the estimation method under a parametric model. A real dataset from a prostate cancer study is used for illustration of the new proposed method.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5983-5995"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142772456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-20DOI: 10.1002/sim.10272
Xiaogang Su, Lei Liu, Lili Liu, Ruiwen Zhou, Guoqiao Wang, Elise Dusseldorp, Tianni Zhou
We propose a novel regression tree method named "TreeFuL," an abbreviation for 'Tree with Fused Leaves.' TreeFuL innovatively combines recursive partitioning with fused regularization, offering a distinct approach to the conventional pruning method. One of TreeFuL's noteworthy advantages is its capacity for cross-validated amalgamation of non-neighboring terminal nodes. This is facilitated by a leaf coloring scheme that supports tree shearing and node amalgamation. As a result, TreeFuL facilitates the development of more parsimonious tree models without compromising predictive accuracy. The refined model offers enhanced interpretability, making it particularly well-suited for biomedical applications of decision trees, such as disease diagnosis and prognosis. We demonstrate the practical advantages of our proposed method through simulation studies and an analysis of data collected in an obesity study.
{"title":"Regression Trees With Fused Leaves.","authors":"Xiaogang Su, Lei Liu, Lili Liu, Ruiwen Zhou, Guoqiao Wang, Elise Dusseldorp, Tianni Zhou","doi":"10.1002/sim.10272","DOIUrl":"10.1002/sim.10272","url":null,"abstract":"<p><p>We propose a novel regression tree method named \"TreeFuL,\" an abbreviation for 'Tree with Fused Leaves.' TreeFuL innovatively combines recursive partitioning with fused regularization, offering a distinct approach to the conventional pruning method. One of TreeFuL's noteworthy advantages is its capacity for cross-validated amalgamation of non-neighboring terminal nodes. This is facilitated by a leaf coloring scheme that supports tree shearing and node amalgamation. As a result, TreeFuL facilitates the development of more parsimonious tree models without compromising predictive accuracy. The refined model offers enhanced interpretability, making it particularly well-suited for biomedical applications of decision trees, such as disease diagnosis and prognosis. We demonstrate the practical advantages of our proposed method through simulation studies and an analysis of data collected in an obesity study.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5872-5884"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771769/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142682916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-28DOI: 10.1002/sim.10266
Chanmin Kim, Yisheng Li, Ting Xu, Zhongxing Liao
One goal of precision medicine is to develop effective treatments for patients by tailoring to their individual demographic, clinical, and/or genetic characteristics. To achieve this goal, statistical models must be developed that can identify and evaluate potentially heterogeneous treatment effects in a robust manner. The oft-cited existing methods for assessing treatment effect heterogeneity are based upon parametric models with interactions or conditioning on covariate values, the performance of which is sensitive to the omission of important covariates and/or the choice of their values. We propose a new Bayesian nonparametric (BNP) method for estimating heterogeneous causal effects in studies with zero-inflated outcome data, which arise commonly in health-related studies. We employ the enriched Dirichlet process (EDP) mixture in our BNP approach, establishing a connection between an outcome DP mixture and a covariate DP mixture. This enables us to estimate posterior distributions concurrently, facilitating flexible inference regarding individual causal effects. We show in a set of simulation studies that the proposed method outperforms two other BNP methods in terms of bias and mean squared error (MSE) of the conditional average treatment effect estimates. In particular, the proposed model has the advantage of appropriately reflecting uncertainty in regions where the overlap condition is violated compared to other competing models. We apply the proposed method to a study of the relationship between heart radiation dose parameters and the blood level of high-sensitivity cardiac troponin T (hs-cTnT) to examine if the effect of a high mean heart radiation dose on hs-cTnT varies by baseline characteristics.
{"title":"Bayesian Nonparametric Model for Heterogeneous Treatment Effects With Zero-Inflated Data.","authors":"Chanmin Kim, Yisheng Li, Ting Xu, Zhongxing Liao","doi":"10.1002/sim.10266","DOIUrl":"10.1002/sim.10266","url":null,"abstract":"<p><p>One goal of precision medicine is to develop effective treatments for patients by tailoring to their individual demographic, clinical, and/or genetic characteristics. To achieve this goal, statistical models must be developed that can identify and evaluate potentially heterogeneous treatment effects in a robust manner. The oft-cited existing methods for assessing treatment effect heterogeneity are based upon parametric models with interactions or conditioning on covariate values, the performance of which is sensitive to the omission of important covariates and/or the choice of their values. We propose a new Bayesian nonparametric (BNP) method for estimating heterogeneous causal effects in studies with zero-inflated outcome data, which arise commonly in health-related studies. We employ the enriched Dirichlet process (EDP) mixture in our BNP approach, establishing a connection between an outcome DP mixture and a covariate DP mixture. This enables us to estimate posterior distributions concurrently, facilitating flexible inference regarding individual causal effects. We show in a set of simulation studies that the proposed method outperforms two other BNP methods in terms of bias and mean squared error (MSE) of the conditional average treatment effect estimates. In particular, the proposed model has the advantage of appropriately reflecting uncertainty in regions where the overlap condition is violated compared to other competing models. We apply the proposed method to a study of the relationship between heart radiation dose parameters and the blood level of high-sensitivity cardiac troponin T (hs-cTnT) to examine if the effect of a high mean heart radiation dose on hs-cTnT varies by baseline characteristics.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5968-5982"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142751737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-25DOI: 10.1002/sim.10292
Guoqiao Wang, Jason Hassenstab, Yan Li, Andrew J Aschenbrenner, Eric M McDade, Jorge Llibre-Guerra, Randall J Bateman, Chengjie Xiong
Measurement burst designs typically administer brief cognitive tests four times per day for 1 week, resulting in a maximum of 28 data points per week per test for every 6 months. In Alzheimer's disease clinical trials, utilizing measurement burst designs holds great promise for boosting statistical power by collecting huge amount of data. However, appropriate methods for analyzing these complex datasets are not well investigated. Furthermore, the large amount of burst design data also poses tremendous challenges for traditional computational procedures such as SAS mixed or Nlmixed. We propose to analyze burst design data using novel hierarchical linear mixed effects models or hierarchical mixed models for repeated measures. Through simulations and real-world data applications using the novel SAS procedure Hpmixed, we demonstrate these hierarchical models' efficiency over traditional models. Our sample simulation and analysis code can serve as a catalyst to facilitate the methodology development for burst design data.
测量突变设计通常每天进行四次简短的认知测试,每次测试持续一周,每六个月每周最多可获得 28 个数据点。在阿尔茨海默病临床试验中,利用测量突变设计通过收集大量数据来提高统计能力大有可为。然而,分析这些复杂数据集的适当方法还没有得到很好的研究。此外,大量的突发设计数据也给 SAS 混合或 Nlmixed 等传统计算程序带来了巨大挑战。我们建议使用新型分层线性混合效应模型或重复测量分层混合模型来分析突发设计数据。通过使用新型 SAS 程序 Hpmixed 进行模拟和实际数据应用,我们证明了这些层次模型比传统模型更高效。我们的模拟和分析代码样本可作为促进突发设计数据方法开发的催化剂。
{"title":"Unlocking Cognitive Analysis Potential in Alzheimer's Disease Clinical Trials: Investigating Hierarchical Linear Models for Analyzing Novel Measurement Burst Design Data.","authors":"Guoqiao Wang, Jason Hassenstab, Yan Li, Andrew J Aschenbrenner, Eric M McDade, Jorge Llibre-Guerra, Randall J Bateman, Chengjie Xiong","doi":"10.1002/sim.10292","DOIUrl":"10.1002/sim.10292","url":null,"abstract":"<p><p>Measurement burst designs typically administer brief cognitive tests four times per day for 1 week, resulting in a maximum of 28 data points per week per test for every 6 months. In Alzheimer's disease clinical trials, utilizing measurement burst designs holds great promise for boosting statistical power by collecting huge amount of data. However, appropriate methods for analyzing these complex datasets are not well investigated. Furthermore, the large amount of burst design data also poses tremendous challenges for traditional computational procedures such as SAS mixed or Nlmixed. We propose to analyze burst design data using novel hierarchical linear mixed effects models or hierarchical mixed models for repeated measures. Through simulations and real-world data applications using the novel SAS procedure Hpmixed, we demonstrate these hierarchical models' efficiency over traditional models. Our sample simulation and analysis code can serve as a catalyst to facilitate the methodology development for burst design data.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5898-5910"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142717271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-11DOI: 10.1002/sim.10259
Sierra Pugh, Andrew T Levin, Gideon Meyerowitz-Katz, Satej Soman, Nana Owusu-Boaitey, Anthony B Zwi, Anup Malani, Ander Wilson, Bailey K Fosdick
The COVID-19 infection fatality rate (IFR) is the proportion of individuals infected with SARS-CoV-2 who subsequently die. As COVID-19 disproportionately affects older individuals, age-specific IFR estimates are imperative to facilitate comparisons of the impact of COVID-19 between locations and prioritize distribution of scarce resources. However, there lacks a coherent method to synthesize available data to create estimates of IFR and seroprevalence that vary continuously with age and adequately reflect uncertainties inherent in the underlying data. In this article, we introduce a novel Bayesian hierarchical model to estimate IFR as a continuous function of age that acknowledges heterogeneity in population age structure across locations and accounts for uncertainty in the estimates due to seroprevalence sampling variability and the imperfect serology test assays. Our approach simultaneously models test assay characteristics, serology, and death data, where the serology and death data are often available only for binned age groups. Information is shared across locations through hierarchical modeling to improve estimation of the parameters with limited data. Modeling data from 26 developing country locations during the first year of the COVID-19 pandemic, we found seroprevalence did not change dramatically with age, and the IFR at age 60 was above the high-income country estimate for most locations.
{"title":"A Hierarchical Bayesian Model for Estimating Age-Specific COVID-19 Infection Fatality Rates in Developing Countries.","authors":"Sierra Pugh, Andrew T Levin, Gideon Meyerowitz-Katz, Satej Soman, Nana Owusu-Boaitey, Anthony B Zwi, Anup Malani, Ander Wilson, Bailey K Fosdick","doi":"10.1002/sim.10259","DOIUrl":"10.1002/sim.10259","url":null,"abstract":"<p><p>The COVID-19 infection fatality rate (IFR) is the proportion of individuals infected with SARS-CoV-2 who subsequently die. As COVID-19 disproportionately affects older individuals, age-specific IFR estimates are imperative to facilitate comparisons of the impact of COVID-19 between locations and prioritize distribution of scarce resources. However, there lacks a coherent method to synthesize available data to create estimates of IFR and seroprevalence that vary continuously with age and adequately reflect uncertainties inherent in the underlying data. In this article, we introduce a novel Bayesian hierarchical model to estimate IFR as a continuous function of age that acknowledges heterogeneity in population age structure across locations and accounts for uncertainty in the estimates due to seroprevalence sampling variability and the imperfect serology test assays. Our approach simultaneously models test assay characteristics, serology, and death data, where the serology and death data are often available only for binned age groups. Information is shared across locations through hierarchical modeling to improve estimation of the parameters with limited data. Modeling data from 26 developing country locations during the first year of the COVID-19 pandemic, we found seroprevalence did not change dramatically with age, and the IFR at age 60 was above the high-income country estimate for most locations.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5667-5680"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142627999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-12-01DOI: 10.1002/sim.10290
Jie Ding, Jialiang Li, Ping Xie, Xiaoguang Wang
Using informative sources to enhance statistical analysis in target studies has become an increasingly popular research topic. However, cohorts with time-to-event outcomes have not received sufficient attention, and external studies often encounter issues of incomparability due to population heterogeneity and unmeasured risk factors. To improve individualized risk assessments, we propose a novel methodology that adaptively borrows information from multiple incomparable sources. By extracting aggregate statistics through transitional models applied to both the external sources and the target population, we incorporate this information efficiently using the control variate technique. This approach eliminates the need to load individual-level records from sources directly, resulting in low computational complexity and strong privacy protection. Asymptotically, our estimators of both relative and baseline risks are more efficient than traditional results, and the power of covariate effects testing is much enhanced. We demonstrate the practical performance of our method via extensive simulations and a real case study.
{"title":"Efficient Risk Assessment of Time-to-Event Targets With Adaptive Information Transfer.","authors":"Jie Ding, Jialiang Li, Ping Xie, Xiaoguang Wang","doi":"10.1002/sim.10290","DOIUrl":"10.1002/sim.10290","url":null,"abstract":"<p><p>Using informative sources to enhance statistical analysis in target studies has become an increasingly popular research topic. However, cohorts with time-to-event outcomes have not received sufficient attention, and external studies often encounter issues of incomparability due to population heterogeneity and unmeasured risk factors. To improve individualized risk assessments, we propose a novel methodology that adaptively borrows information from multiple incomparable sources. By extracting aggregate statistics through transitional models applied to both the external sources and the target population, we incorporate this information efficiently using the control variate technique. This approach eliminates the need to load individual-level records from sources directly, resulting in low computational complexity and strong privacy protection. Asymptotically, our estimators of both relative and baseline risks are more efficient than traditional results, and the power of covariate effects testing is much enhanced. We demonstrate the practical performance of our method via extensive simulations and a real case study.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"6026-6041"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142772452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-12DOI: 10.1002/sim.10263
Ye Tian, Henry Rusinek, Arjun V Masurkar, Yang Feng
High-dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast-based -penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and value of the individual hypothesis test. We also examine cases of model misspecification and non-identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods.
高维多叉回归模型在实践中非常有用,但与逻辑回归模型相比,它受到的研究关注较少,尤其是从统计推断的角度来看。在这项工作中,我们分析了基于对比度的 ℓ 1 $$ {ell}_1 $$ -penalized 多叉回归模型的估计和预测误差,并将去尾法扩展到多叉情况,为每个系数和单个假设检验的 p $$ p $$ 值提供了有效的置信区间。我们还研究了模型规范错误和非同分布数据的情况,以证明我们的方法在违反某些假设时的稳健性。我们应用去杂方法来识别不同亚型痴呆症进展过程中的重要预测因素。大量模拟结果表明,与其他推理方法相比,去杂方法更具优势。
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\"><ns0:math> <ns0:semantics> <ns0:mrow> <ns0:msub><ns0:mrow><ns0:mi>ℓ</ns0:mi></ns0:mrow> <ns0:mrow><ns0:mn>1</ns0:mn></ns0:mrow> </ns0:msub> </ns0:mrow> <ns0:annotation>$$ {ell}_1 $$</ns0:annotation></ns0:semantics> </ns0:math> -Penalized Multinomial Regression: Estimation, Inference, and Prediction, With an Application to Risk Factor Identification for Different Dementia Subtypes.","authors":"Ye Tian, Henry Rusinek, Arjun V Masurkar, Yang Feng","doi":"10.1002/sim.10263","DOIUrl":"10.1002/sim.10263","url":null,"abstract":"<p><p>High-dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast-based <math> <semantics> <mrow> <msub><mrow><mi>ℓ</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> <annotation>$$ {ell}_1 $$</annotation></semantics> </math> -penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and <math> <semantics><mrow><mi>p</mi></mrow> <annotation>$$ p $$</annotation></semantics> </math> value of the individual hypothesis test. We also examine cases of model misspecification and non-identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5711-5747"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142627865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-19DOI: 10.1002/sim.10280
Ying Sheng, Yifei Sun
Proportional rate models are among the most popular methods for analyzing recurrent event data. Although providing a straightforward rate-ratio interpretation of covariate effects, the proportional rate assumption implies that covariates do not modify the shape of the rate function. When the proportionality assumption fails to hold, we propose to characterize covariate effects on the rate function through two types of parameters: the shape parameters and the size parameters. The former allows the covariates to flexibly affect the shape of the rate function, and the latter retains the interpretability of covariate effects on the magnitude of the rate function. To overcome the challenges in simultaneously estimating the two sets of parameters, we propose a conditional pseudolikelihood approach to eliminate the size parameters in shape estimation, followed by an event count projection approach for size estimation. The proposed estimators are asymptotically normal with a root- convergence rate. Simulation studies and an analysis of recurrent hospitalizations using SEER-Medicare data are conducted to illustrate the proposed methods.
比例率模型是分析重复事件数据最常用的方法之一。虽然该模型提供了对协变量效应的直接比率解释,但比例比率假设意味着协变量不会改变比率函数的形状。当比例假设不成立时,我们建议通过两类参数来描述协变量对比率函数的影响:形状参数和大小参数。前者允许协变量灵活地影响速率函数的形状,后者保留了协变量对速率函数大小影响的可解释性。为了克服同时估计两组参数所带来的挑战,我们提出了一种条件伪似然法来消除形状估计中的大小参数,然后用事件计数投影法进行大小估计。所提出的估计值是渐近正态的,收敛率为根 n $$ n $$。我们利用 SEER-Medicare 数据进行了模拟研究和复发性住院分析,以说明所提出的方法。
{"title":"Statistical Inference for Counting Processes Under Shape Heterogeneity.","authors":"Ying Sheng, Yifei Sun","doi":"10.1002/sim.10280","DOIUrl":"10.1002/sim.10280","url":null,"abstract":"<p><p>Proportional rate models are among the most popular methods for analyzing recurrent event data. Although providing a straightforward rate-ratio interpretation of covariate effects, the proportional rate assumption implies that covariates do not modify the shape of the rate function. When the proportionality assumption fails to hold, we propose to characterize covariate effects on the rate function through two types of parameters: the shape parameters and the size parameters. The former allows the covariates to flexibly affect the shape of the rate function, and the latter retains the interpretability of covariate effects on the magnitude of the rate function. To overcome the challenges in simultaneously estimating the two sets of parameters, we propose a conditional pseudolikelihood approach to eliminate the size parameters in shape estimation, followed by an event count projection approach for size estimation. The proposed estimators are asymptotically normal with a root- <math> <semantics><mrow><mi>n</mi></mrow> <annotation>$$ n $$</annotation></semantics> </math> convergence rate. Simulation studies and an analysis of recurrent hospitalizations using SEER-Medicare data are conducted to illustrate the proposed methods.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5849-5861"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142676818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30Epub Date: 2024-11-20DOI: 10.1002/sim.10283
Dadong Zhang, Jingye Wang, Suqin Cai, Johan Surtihadi
The positive predictive value (PPV) and negative predictive value (NPV) can be expressed as functions of disease prevalence ( ) and the ratios of two binomial proportions ( ), where and . In prospective studies, where the proportion of subjects with the disease in the study cohort is an unbiased estimate of the disease prevalence, the confidence intervals (CIs) of PPV and NPV can be estimated using established methods for single proportion. However, in enrichment studies, such as case-control studies, where the proportion of diseased subjects significantly differs from disease prevalence, estimating CIs for PPV and NPV remains a challenge in terms of skewness and overall coverage, especially under extreme conditions (e.g., ). In this article, we extend the method adopted by Li, where CIs for PPV and NPV were derived from those of . We explored additional CI methods for , including those by Gart & Nam (GN), MoverJ, and Walter and convert their corresponding CIs for PPV and NPV. Through simulations, we compared these methods with established CI methods, Fieller, Pepe, and Delta in terms of skewness and overall coverage. While no method proves universally optimal, GN and MoverJ methods generally emerge as recommended choices.
{"title":"Skewness-Corrected Confidence Intervals for Predictive Values in Enrichment Studies.","authors":"Dadong Zhang, Jingye Wang, Suqin Cai, Johan Surtihadi","doi":"10.1002/sim.10283","DOIUrl":"10.1002/sim.10283","url":null,"abstract":"<p><p>The positive predictive value (PPV) and negative predictive value (NPV) can be expressed as functions of disease prevalence ( <math> <semantics><mrow><mi>ρ</mi></mrow> <annotation>$$ rho $$</annotation></semantics> </math> ) and the ratios of two binomial proportions ( <math> <semantics><mrow><mi>ϕ</mi></mrow> <annotation>$$ phi $$</annotation></semantics> </math> ), where <math> <semantics> <mrow><msub><mi>ϕ</mi> <mi>ppv</mi></msub> <mo>=</mo> <mfrac><mrow><mn>1</mn> <mo>-</mo> <mtext>specificity</mtext></mrow> <mtext>sensitivity</mtext></mfrac> </mrow> <annotation>$$ {phi}_{ppv}=frac{1- specificity}{sensitivity} $$</annotation></semantics> </math> and <math> <semantics> <mrow><msub><mi>ϕ</mi> <mi>npv</mi></msub> <mo>=</mo> <mfrac><mrow><mn>1</mn> <mo>-</mo> <mtext>sensitivity</mtext></mrow> <mtext>specificity</mtext></mfrac> </mrow> <annotation>$$ {phi}_{npv}=frac{1- sensitivity}{specificity} $$</annotation></semantics> </math> . In prospective studies, where the proportion of subjects with the disease in the study cohort is an unbiased estimate of the disease prevalence, the confidence intervals (CIs) of PPV and NPV can be estimated using established methods for single proportion. However, in enrichment studies, such as case-control studies, where the proportion of diseased subjects significantly differs from disease prevalence, estimating CIs for PPV and NPV remains a challenge in terms of skewness and overall coverage, especially under extreme conditions (e.g., <math> <semantics><mrow><mi>NPV</mi> <mo>=</mo> <mn>1</mn></mrow> <annotation>$$ mathrm{NPV}=1 $$</annotation></semantics> </math> ). In this article, we extend the method adopted by Li, where CIs for PPV and NPV were derived from those of <math> <semantics><mrow><mi>ϕ</mi></mrow> <annotation>$$ phi $$</annotation></semantics> </math> . We explored additional CI methods for <math> <semantics><mrow><mi>ϕ</mi></mrow> <annotation>$$ phi $$</annotation></semantics> </math> , including those by Gart & Nam (GN), MoverJ, and Walter and convert their corresponding CIs for PPV and NPV. Through simulations, we compared these methods with established CI methods, Fieller, Pepe, and Delta in terms of skewness and overall coverage. While no method proves universally optimal, GN and MoverJ methods generally emerge as recommended choices.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5862-5871"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142682841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}