Shengxian Ding, Debajyoti Sinha, Greg Hajcak, Roman Kotov, Chao Huang
Existing research in mental health has established that rising depressive symptoms in adolescents are associated with parental history of depression and other behavioral risk factors. Our goal is to investigate how these scalar variables, together with multiple functional covariates capturing neural responses to rewards, relate to future adolescent depression. Departing from prior studies that typically relied on simple linear regression to model all covariates, we propose a novel Bayesian quantile regression framework. This approach constructs a single-index summary of both scalar and functional covariates, coupled with a monotone link function that flexibly captures unknown nonlinear relationships and interactions. Our method addresses several limitations of existing approaches. It offers a clinically interpretable index akin to that of linear models, ensures that the estimated quantile remains within the response bounds, and jointly incorporates the registration of functional covariates within the quantile regression analysis. Our simulation studies demonstrate that our method outperforms existing unrestricted single-index-based methods, particularly when there are both scalar and preregistered functional covariates. Furthermore, we showcase the practical utility of our framework using data from a large-scale adolescent depression study, yielding a new, statistically principled summary of neural reward processing with direct relevance to future depression risk.
{"title":"Bayesian monotone single-index quantile regression model with bounded response and misaligned functional covariates.","authors":"Shengxian Ding, Debajyoti Sinha, Greg Hajcak, Roman Kotov, Chao Huang","doi":"10.1093/biomtc/ujaf145","DOIUrl":"10.1093/biomtc/ujaf145","url":null,"abstract":"<p><p>Existing research in mental health has established that rising depressive symptoms in adolescents are associated with parental history of depression and other behavioral risk factors. Our goal is to investigate how these scalar variables, together with multiple functional covariates capturing neural responses to rewards, relate to future adolescent depression. Departing from prior studies that typically relied on simple linear regression to model all covariates, we propose a novel Bayesian quantile regression framework. This approach constructs a single-index summary of both scalar and functional covariates, coupled with a monotone link function that flexibly captures unknown nonlinear relationships and interactions. Our method addresses several limitations of existing approaches. It offers a clinically interpretable index akin to that of linear models, ensures that the estimated quantile remains within the response bounds, and jointly incorporates the registration of functional covariates within the quantile regression analysis. Our simulation studies demonstrate that our method outperforms existing unrestricted single-index-based methods, particularly when there are both scalar and preregistered functional covariates. Furthermore, we showcase the practical utility of our framework using data from a large-scale adolescent depression study, yielding a new, statistically principled summary of neural reward processing with direct relevance to future depression risk.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145385616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keith Barnatchez, Rachel Nethery, Bryan E Shepherd, Giovanni Parmigiani, Kevin P Josey
Exposure measurement error is a ubiquitous but often overlooked challenge in causal inference with observational data. Existing methods accounting for exposure measurement error largely rely on restrictive parametric assumptions, while emerging data-adaptive estimation approaches allow for less restrictive assumptions but at the cost of flexibility, as they are typically tailored toward rigidly defined statistical quantities. There remains a critical need for assumption-lean estimation methods that are both flexible and possess desirable theoretical properties across a variety of study designs. In this paper, we introduce a general framework for estimation of causal quantities in the presence of exposure measurement error, adapted from the method of control variates. Our method can be implemented in various two-phase sampling study designs, where one obtains gold-standard exposure measurements for a small subset of the full study sample, called the validation data. The control variates framework leverages both the error-prone and error-free exposure measurements by augmenting an initial consistent estimator from the validation data with a variance reduction term formed from the full data. We show that our method inherits double-robustness properties under standard causal assumptions. Simulation studies show that our approach performs favorably compared to leading methods under various two-phase sampling schemes. We illustrate our method with observational electronic health record data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic.
{"title":"Flexible and efficient estimation of causal effects with error-prone exposures: a control variates approach for measurement error.","authors":"Keith Barnatchez, Rachel Nethery, Bryan E Shepherd, Giovanni Parmigiani, Kevin P Josey","doi":"10.1093/biomtc/ujaf151","DOIUrl":"10.1093/biomtc/ujaf151","url":null,"abstract":"<p><p>Exposure measurement error is a ubiquitous but often overlooked challenge in causal inference with observational data. Existing methods accounting for exposure measurement error largely rely on restrictive parametric assumptions, while emerging data-adaptive estimation approaches allow for less restrictive assumptions but at the cost of flexibility, as they are typically tailored toward rigidly defined statistical quantities. There remains a critical need for assumption-lean estimation methods that are both flexible and possess desirable theoretical properties across a variety of study designs. In this paper, we introduce a general framework for estimation of causal quantities in the presence of exposure measurement error, adapted from the method of control variates. Our method can be implemented in various two-phase sampling study designs, where one obtains gold-standard exposure measurements for a small subset of the full study sample, called the validation data. The control variates framework leverages both the error-prone and error-free exposure measurements by augmenting an initial consistent estimator from the validation data with a variance reduction term formed from the full data. We show that our method inherits double-robustness properties under standard causal assumptions. Simulation studies show that our approach performs favorably compared to leading methods under various two-phase sampling schemes. We illustrate our method with observational electronic health record data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12882774/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145653462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longitudinal data are often subject to irregular and informative visit times. Weighting generalized estimating equations by the inverse of the visit rate yields asymptotically unbiased estimates of regression coefficients provided that outcomes and visit times are conditionally independent, given the covariates in the visit model. Adding other covariates has no impact on the asymptotic bias of estimated regression coefficients, provided that conditional independence is maintained, but the impact on their variances is unknown. We show that variances are unchanged on adding variables associated with neither outcome nor visit process, and decrease on adding variables associated with outcome but not visit process. Adding variables associated with visits but not outcome may either increase or decrease variances of estimated outcome model regression coefficients, depending on the correlation structure of the covariates and the outcome. Application to a study of major depressive disorder found that the variances of estimated regression coefficients were of a similar magnitude when predictors of outcome but not visits were added to the visit rate model but consistently larger, in some cases by a factor of 2, on adding predictors of visits but not outcome. We recommend that visit process models include variables associated with outcome, but that those unassociated with the outcome be treated with caution.
{"title":"Inverse-intensity weighted generalized estimating equations for longitudinal data subject to irregular observation: which variables should be included in the visit rate model?","authors":"Eleanor M Pullenayegum, Di Shan","doi":"10.1093/biomtc/ujaf128","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf128","url":null,"abstract":"<p><p>Longitudinal data are often subject to irregular and informative visit times. Weighting generalized estimating equations by the inverse of the visit rate yields asymptotically unbiased estimates of regression coefficients provided that outcomes and visit times are conditionally independent, given the covariates in the visit model. Adding other covariates has no impact on the asymptotic bias of estimated regression coefficients, provided that conditional independence is maintained, but the impact on their variances is unknown. We show that variances are unchanged on adding variables associated with neither outcome nor visit process, and decrease on adding variables associated with outcome but not visit process. Adding variables associated with visits but not outcome may either increase or decrease variances of estimated outcome model regression coefficients, depending on the correlation structure of the covariates and the outcome. Application to a study of major depressive disorder found that the variances of estimated regression coefficients were of a similar magnitude when predictors of outcome but not visits were added to the visit rate model but consistently larger, in some cases by a factor of 2, on adding predictors of visits but not outcome. We recommend that visit process models include variables associated with outcome, but that those unassociated with the outcome be treated with caution.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145249508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Zeng, Qiyu Wang, Zijian Sui, Hong Zhang, Jinfeng Xu
Causal inference using observational data often suffers from numerous confounding effects, with greatly distorted average causal effect (ACE) estimates if the confounders are ignored. Information on some confounders, such as genetic biomarkers and medical imaging, is prohibitively expensive to obtain in practice. Two-phase studies are resource-efficient solutions to this problem. In such studies, outcome, treatment, and inexpensive confounders are measured for a large number of subjects in the first phase; costly confounder measurements are then collected for a limited number of subjects in the second phase. An efficient statistical design is essential in controlling the cost arising in the second phase. In this paper, we propose an adaptive stratified sampling design (AdaStrat), which minimizes the variance of the ACE estimator with a given second-phase sample size. AdaStrat begins with gathering costly confounder measures for randomly selected pilot data, which are used to develop a stratification strategy and determine the sampling probabilities of strata. The resulting stratification and sampling strategy is applied to all first-phase subjects to determine the second-phase subjects with costly confounders measures. We rigorously show that AdaStrat produces a more efficient ACE estimator compared with the existing sampling designs with strata being prefixed. Finite sample properties of AdaStrat were evaluated through simulation studies, demonstrating its superiority against the fixed stratified sampling design (FixStrat), with relative efficiencies ranging from 20% to 30% in our simulation situations. The desired finite sample properties for AdaStrat were further confirmed through the application of the UK Biobank data.
{"title":"Adaptive stratified sampling design in two-phase studies for average causal effect estimation.","authors":"Min Zeng, Qiyu Wang, Zijian Sui, Hong Zhang, Jinfeng Xu","doi":"10.1093/biomtc/ujaf143","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf143","url":null,"abstract":"<p><p>Causal inference using observational data often suffers from numerous confounding effects, with greatly distorted average causal effect (ACE) estimates if the confounders are ignored. Information on some confounders, such as genetic biomarkers and medical imaging, is prohibitively expensive to obtain in practice. Two-phase studies are resource-efficient solutions to this problem. In such studies, outcome, treatment, and inexpensive confounders are measured for a large number of subjects in the first phase; costly confounder measurements are then collected for a limited number of subjects in the second phase. An efficient statistical design is essential in controlling the cost arising in the second phase. In this paper, we propose an adaptive stratified sampling design (AdaStrat), which minimizes the variance of the ACE estimator with a given second-phase sample size. AdaStrat begins with gathering costly confounder measures for randomly selected pilot data, which are used to develop a stratification strategy and determine the sampling probabilities of strata. The resulting stratification and sampling strategy is applied to all first-phase subjects to determine the second-phase subjects with costly confounders measures. We rigorously show that AdaStrat produces a more efficient ACE estimator compared with the existing sampling designs with strata being prefixed. Finite sample properties of AdaStrat were evaluated through simulation studies, demonstrating its superiority against the fixed stratified sampling design (FixStrat), with relative efficiencies ranging from 20% to 30% in our simulation situations. The desired finite sample properties for AdaStrat were further confirmed through the application of the UK Biobank data.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oliver J Hines, Karla Diaz-Ordaz, Stijn Vansteelandt
Motivated by applications in precision medicine and treatment effect heterogeneity, recent research has focused on estimating conditional average treatment effects (CATEs) using machine learning (ML). CATE estimates may represent complicated functions that provide little insight into the key drivers of heterogeneity. Therefore, we introduce nonparametric treatment effect variable importance measures (TE-VIMs), based on the mean-squared error (MSE) in predicting the individual treatment effect. More precisely, TE-VIMs represent the increase in MSE when variables are removed from the CATE conditioning set. We derive efficient TE-VIM estimators which can be used with any CATE estimation strategy and are amenable to ML estimation. We propose several strategies to calculate these VIMs (eg, leave-one out, or keep-one in), using popular meta-learners for the CATE. We study the finite sample performance through a simulation study and illustrate their application using clinical trial data.
{"title":"Variable importance measures for heterogeneous treatment effects.","authors":"Oliver J Hines, Karla Diaz-Ordaz, Stijn Vansteelandt","doi":"10.1093/biomtc/ujaf140","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf140","url":null,"abstract":"<p><p>Motivated by applications in precision medicine and treatment effect heterogeneity, recent research has focused on estimating conditional average treatment effects (CATEs) using machine learning (ML). CATE estimates may represent complicated functions that provide little insight into the key drivers of heterogeneity. Therefore, we introduce nonparametric treatment effect variable importance measures (TE-VIMs), based on the mean-squared error (MSE) in predicting the individual treatment effect. More precisely, TE-VIMs represent the increase in MSE when variables are removed from the CATE conditioning set. We derive efficient TE-VIM estimators which can be used with any CATE estimation strategy and are amenable to ML estimation. We propose several strategies to calculate these VIMs (eg, leave-one out, or keep-one in), using popular meta-learners for the CATE. We study the finite sample performance through a simulation study and illustrate their application using clinical trial data.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145817594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying protein-protein interaction networks can reveal therapeutic targets in cancer; however, for heterogeneous cancers such as colorectal cancer (CRC), a pooled analysis of the entire dataset may miss subtype-specific mechanisms, whereas separate analyses of each subgroup's data may reduce the power to identify shared relations. To address this limitation, we propose a hierarchical Bayesian model for the inference of dependency networks that encourages the common selection of edges across subgroups while allowing subtype-specific connections. To allow for nonlinear dependence relations, we rely on Bayesian Additive Regression Trees (BART) to characterize the key mechanisms for each subgroup. Because BART is a flexible model that allows nonlinear effects and interactions, it is more suitable for genomic data than classical models that assume linearity. To connect the subgroups, we place a Markov random field prior on the probability of utilizing a feature in a splitting rule; this allows us to borrow strength across subgroups in identifying shared dependence relations. We illustrate the model using both simulated data and a real data application on the estimation of protein-protein interaction networks across CRC subtypes.
{"title":"Joint Bayesian additive regression trees for multiple nonlinear dependency networks.","authors":"Licai Huang, Christine B Peterson, Min Jin Ha","doi":"10.1093/biomtc/ujaf158","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf158","url":null,"abstract":"<p><p>Identifying protein-protein interaction networks can reveal therapeutic targets in cancer; however, for heterogeneous cancers such as colorectal cancer (CRC), a pooled analysis of the entire dataset may miss subtype-specific mechanisms, whereas separate analyses of each subgroup's data may reduce the power to identify shared relations. To address this limitation, we propose a hierarchical Bayesian model for the inference of dependency networks that encourages the common selection of edges across subgroups while allowing subtype-specific connections. To allow for nonlinear dependence relations, we rely on Bayesian Additive Regression Trees (BART) to characterize the key mechanisms for each subgroup. Because BART is a flexible model that allows nonlinear effects and interactions, it is more suitable for genomic data than classical models that assume linearity. To connect the subgroups, we place a Markov random field prior on the probability of utilizing a feature in a splitting rule; this allows us to borrow strength across subgroups in identifying shared dependence relations. We illustrate the model using both simulated data and a real data application on the estimation of protein-protein interaction networks across CRC subtypes.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145740907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Qiu, Yuhan Qian, Jaehwan Yi, Jinqiu Wang, Yu Du, Yanyao Yi, Ting Ye
The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common (constant) risk difference assumption. In our paper, we relax this assumption and adopt a modern perspective, viewing the MH risk difference estimator as an approach for covariate adjustment in randomized clinical trials, distinguishing its use from that in meta-analysis and observational studies. We demonstrate that, under reasonable restrictions on risk difference variability, the MH risk difference estimator consistently estimates the average treatment effect within a standard super-population framework, which is often the primary interest in randomized clinical trials, in addition to estimating a weighted average of stratum-specific risk differences. We rigorously study its properties under the large-stratum and sparse-stratum asymptotic regimes, as well as under mixed-regime settings. Furthermore, for either estimand, we propose a unified robust variance estimator that improves over the popular variance estimators by Greenland and Robins and Sato et al. and has provable consistency across these asymptotic regimes, regardless of assuming common risk differences. Extensions of our theoretical results also provide new insights into the MH test, the post-stratification estimator, and settings with multiple treatments. Our findings are thoroughly validated through simulations and a clinical trial example.
{"title":"Clarifying the role of the Mantel-Haenszel risk difference estimator in randomized clinical trials.","authors":"Xiaoyu Qiu, Yuhan Qian, Jaehwan Yi, Jinqiu Wang, Yu Du, Yanyao Yi, Ting Ye","doi":"10.1093/biomtc/ujaf142","DOIUrl":"10.1093/biomtc/ujaf142","url":null,"abstract":"<p><p>The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common (constant) risk difference assumption. In our paper, we relax this assumption and adopt a modern perspective, viewing the MH risk difference estimator as an approach for covariate adjustment in randomized clinical trials, distinguishing its use from that in meta-analysis and observational studies. We demonstrate that, under reasonable restrictions on risk difference variability, the MH risk difference estimator consistently estimates the average treatment effect within a standard super-population framework, which is often the primary interest in randomized clinical trials, in addition to estimating a weighted average of stratum-specific risk differences. We rigorously study its properties under the large-stratum and sparse-stratum asymptotic regimes, as well as under mixed-regime settings. Furthermore, for either estimand, we propose a unified robust variance estimator that improves over the popular variance estimators by Greenland and Robins and Sato et al. and has provable consistency across these asymptotic regimes, regardless of assuming common risk differences. Extensions of our theoretical results also provide new insights into the MH test, the post-stratification estimator, and settings with multiple treatments. Our findings are thoroughly validated through simulations and a clinical trial example.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12576803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145420927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Micro-randomized trials (MRTs) play a crucial role in optimizing digital interventions. In an MRT, each participant is sequentially randomized among treatment options hundreds of times. While the interventions tested in MRTs target short-term behavioral responses (proximal outcomes), their ultimate goal is to drive long-term behavior change (distal outcomes). However, existing causal inference methods, such as the causal excursion effect, are limited to proximal outcomes, making it challenging to quantify the long-term impact of interventions. To address this gap, we introduce the distal causal excursion effect (DCEE), a novel estimand that quantifies the long-term effect of time-varying treatments. The DCEE contrasts distal outcomes under two excursion policies while marginalizing over most treatment assignments, enabling a parsimonious and interpretable causal model even with a large number of decision points. We propose two estimators for the DCEE-one with cross-fitting and one without-both robust to misspecification of the outcome model. We establish their asymptotic properties and validate their performance through simulations. We apply our method to the HeartSteps MRT to assess the impact of activity prompts on long-term habit formation. Our findings suggest that prompts delivered earlier in the study have a stronger long-term effect than those delivered later, underscoring the importance of intervention timing in behavior change. This work provides the critically needed toolkit for scientists working on digital interventions to assess long-term causal effects using MRT data.
{"title":"Distal causal excursion effects: modeling long-term effects of time-varying treatments in micro-randomized trials.","authors":"Tianchen Qian","doi":"10.1093/biomtc/ujaf134","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf134","url":null,"abstract":"<p><p>Micro-randomized trials (MRTs) play a crucial role in optimizing digital interventions. In an MRT, each participant is sequentially randomized among treatment options hundreds of times. While the interventions tested in MRTs target short-term behavioral responses (proximal outcomes), their ultimate goal is to drive long-term behavior change (distal outcomes). However, existing causal inference methods, such as the causal excursion effect, are limited to proximal outcomes, making it challenging to quantify the long-term impact of interventions. To address this gap, we introduce the distal causal excursion effect (DCEE), a novel estimand that quantifies the long-term effect of time-varying treatments. The DCEE contrasts distal outcomes under two excursion policies while marginalizing over most treatment assignments, enabling a parsimonious and interpretable causal model even with a large number of decision points. We propose two estimators for the DCEE-one with cross-fitting and one without-both robust to misspecification of the outcome model. We establish their asymptotic properties and validate their performance through simulations. We apply our method to the HeartSteps MRT to assess the impact of activity prompts on long-term habit formation. Our findings suggest that prompts delivered earlier in the study have a stronger long-term effect than those delivered later, underscoring the importance of intervention timing in behavior change. This work provides the critically needed toolkit for scientists working on digital interventions to assess long-term causal effects using MRT data.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145298424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The generalized factor models have been widely employed for dimension reduction across various types of multivariate data, including binary choices, counts, and continuous observations. While determining the number of factors in such models has received significant scholarly attention, it remains an open challenge in the field. In this paper, we propose a cross-validation (CV) method based on entrywise splitting (ES), rather than sample splitting, to address this problem. Similar to traditional cross-validation, this approach primarily prevents the underestimation of the number of factors. We then introduce a penalized entrywise splitting cross-validation criterion, which integrates the original CV with information theoretic criteria by adding a penalty term. Its consistency is established under mild conditions in a high-dimensional setting, where both the sample size and the number of features grow to infinity. Furthermore, we extend our methodology to random missing data with different probability scenarios. We evaluate the performance of the proposed method through comprehensive simulations and apply it to a mouse brain single-cell RNA sequencing dataset.
{"title":"Entrywise splitting cross-validation in generalized factor models: from sample splitting to entrywise splitting.","authors":"Zhijing Wang","doi":"10.1093/biomtc/ujaf153","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf153","url":null,"abstract":"<p><p>The generalized factor models have been widely employed for dimension reduction across various types of multivariate data, including binary choices, counts, and continuous observations. While determining the number of factors in such models has received significant scholarly attention, it remains an open challenge in the field. In this paper, we propose a cross-validation (CV) method based on entrywise splitting (ES), rather than sample splitting, to address this problem. Similar to traditional cross-validation, this approach primarily prevents the underestimation of the number of factors. We then introduce a penalized entrywise splitting cross-validation criterion, which integrates the original CV with information theoretic criteria by adding a penalty term. Its consistency is established under mild conditions in a high-dimensional setting, where both the sample size and the number of features grow to infinity. Furthermore, we extend our methodology to random missing data with different probability scenarios. We evaluate the performance of the proposed method through comprehensive simulations and apply it to a mouse brain single-cell RNA sequencing dataset.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Yamauchi, Genya Kobayashi, Shonosuke Sugasawa
Count data frequently arises in biomedical applications, such as the length of hospital stay. However, their discrete nature poses significant challenges for appropriately modeling conditional quantiles, which are crucial for understanding heterogeneous effects and variability in outcomes. To solve the practical difficulty, we propose a novel general Bayesian framework for quantile regression tailored to count data. We seek the regression parameter on the conditional quantile by minimizing the expected loss with respect to the distribution of the conditional quantile of the latent continuous variable associated with the observed count response variable. By modeling the unknown conditional distribution through a Bayesian nonparametric kernel mixture for the joint distribution of the count response and covariates, we obtain the posterior distribution of the regression parameter via a simple optimization. We numerically demonstrate that the proposed method improves bias and estimation accuracy of the existing crude approaches to count quantile regression. Furthermore, we analyze the length of hospital stay for acute myocardial infarction and demonstrate that the proposed method gives more interpretable and flexible results than the existing ones.
{"title":"Flexible Bayesian quantile regression for counts via generative modeling.","authors":"Yuta Yamauchi, Genya Kobayashi, Shonosuke Sugasawa","doi":"10.1093/biomtc/ujaf152","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf152","url":null,"abstract":"<p><p>Count data frequently arises in biomedical applications, such as the length of hospital stay. However, their discrete nature poses significant challenges for appropriately modeling conditional quantiles, which are crucial for understanding heterogeneous effects and variability in outcomes. To solve the practical difficulty, we propose a novel general Bayesian framework for quantile regression tailored to count data. We seek the regression parameter on the conditional quantile by minimizing the expected loss with respect to the distribution of the conditional quantile of the latent continuous variable associated with the observed count response variable. By modeling the unknown conditional distribution through a Bayesian nonparametric kernel mixture for the joint distribution of the count response and covariates, we obtain the posterior distribution of the regression parameter via a simple optimization. We numerically demonstrate that the proposed method improves bias and estimation accuracy of the existing crude approaches to count quantile regression. Furthermore, we analyze the length of hospital stay for acute myocardial infarction and demonstrate that the proposed method gives more interpretable and flexible results than the existing ones.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}