We present a novel framework for concomitant dimension reduction and clustering. This framework is based on a novel class of Bayesian clustering factor models. These models assume a factor model structure where the vectors of common factors follow a mixture of Gaussian distributions. We develop a Gibbs sampler to explore the posterior distribution and propose an information criterion to select the number of clusters and the number of factors. Simulation studies show that our inferential approach appropriately quantifies uncertainty. In addition, when compared to two previously published competitor methods, our information criterion has favorable performance in terms of correct selection of number of clusters and number of factors. Finally, we illustrate the capabilities of our framework with an application to data on recovery from opioid use disorder where clustering of individuals may facilitate personalized health care.
{"title":"Bayesian Clustering Factor Models.","authors":"Hwasoo Shin, Marco A R Ferreira, Allison N Tegge","doi":"10.1002/sim.70350","DOIUrl":"10.1002/sim.70350","url":null,"abstract":"<p><p>We present a novel framework for concomitant dimension reduction and clustering. This framework is based on a novel class of Bayesian clustering factor models. These models assume a factor model structure where the vectors of common factors follow a mixture of Gaussian distributions. We develop a Gibbs sampler to explore the posterior distribution and propose an information criterion to select the number of clusters and the number of factors. Simulation studies show that our inferential approach appropriately quantifies uncertainty. In addition, when compared to two previously published competitor methods, our information criterion has favorable performance in terms of correct selection of number of clusters and number of factors. Finally, we illustrate the capabilities of our framework with an application to data on recovery from opioid use disorder where clustering of individuals may facilitate personalized health care.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70350"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12826354/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146019691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-linear mixed effects models (NLMEMs) defined by ordinary differential equations (ODEs) are central to modeling complex biological systems over time, particularly in pharmacometrics, viral dynamics, and immunology. However, selecting relevant covariates associated with dynamics in high-dimensional settings remains a major challenge. This study introduces a novel model-building approach called Lasso-SAMBA (SAMBA: Stochastic Approximation for Model Building Algorithm) that integrates Lasso regression with a stability selection algorithm for robust covariate selection within ODE-based NLMEMs. The method iteratively constructs models by coupling penalized regression with mechanistic model estimation using the SAEM algorithm. It extends a prior strategy named SAMBA, originally based on stepwise inclusion, by replacing this step with a penalized, stability-driven approach that reduces false discoveries and improves selection robustness. By maintaining the monotonic decrease of the information criterion through a calibrated exploration of penalization parameters, the proposed method outperforms conventional stepwise and Bayesian variable selection alternatives. Extensive simulation studies, spanning pharmacokinetic and immunological models, demonstrate the superiority of Lasso-SAMBA in variable selection fidelity, FDR (False Discovery Proportion) control, and computational efficiency. The Lasso-SAMBA method is implemented in an R package. Applied to a Varicella-Zoster virus vaccination study, the method reveals robust, biologically plausible associations between parameters of the mechanistic model of the humoral immune response and early transcriptomic expressions. These results underscore the practical utility of our method for high-dimensional model building in systems vaccinology and beyond.
{"title":"A Stability-Enhanced Lasso Approach for Covariate Selection in Non-Linear Mixed Effects Model.","authors":"Auriane Gabaut, Rodolphe Thiébaut, Cécile Proust-Lima, Mélanie Prague","doi":"10.1002/sim.70407","DOIUrl":"10.1002/sim.70407","url":null,"abstract":"<p><p>Non-linear mixed effects models (NLMEMs) defined by ordinary differential equations (ODEs) are central to modeling complex biological systems over time, particularly in pharmacometrics, viral dynamics, and immunology. However, selecting relevant covariates associated with dynamics in high-dimensional settings remains a major challenge. This study introduces a novel model-building approach called Lasso-SAMBA (SAMBA: Stochastic Approximation for Model Building Algorithm) that integrates Lasso regression with a stability selection algorithm for robust covariate selection within ODE-based NLMEMs. The method iteratively constructs models by coupling penalized regression with mechanistic model estimation using the SAEM algorithm. It extends a prior strategy named SAMBA, originally based on stepwise inclusion, by replacing this step with a penalized, stability-driven approach that reduces false discoveries and improves selection robustness. By maintaining the monotonic decrease of the information criterion through a calibrated exploration of penalization parameters, the proposed method outperforms conventional stepwise and Bayesian variable selection alternatives. Extensive simulation studies, spanning pharmacokinetic and immunological models, demonstrate the superiority of Lasso-SAMBA in variable selection fidelity, FDR (False Discovery Proportion) control, and computational efficiency. The Lasso-SAMBA method is implemented in an R package. Applied to a Varicella-Zoster virus vaccination study, the method reveals robust, biologically plausible associations between parameters of the mechanistic model of the humoral immune response and early transcriptomic expressions. These results underscore the practical utility of our method for high-dimensional model building in systems vaccinology and beyond.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70407"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146019659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing availability of data from different sources, there is a growing interest in leveraging summary information from external studies to improve parameter estimation efficiency for the internal study that collects individual-level data. However, when analyzing right-censored survival data, covariate effects often vary across studies due to differences in study environments, research designs, and patients' inclusion criteria. Such heterogeneity, if not accounted for properly, can lead to biased estimates of covariate effects. In this article, we develop a Privacy-preserving and Heterogeneity-aware Integration (PHI) method to improve efficiency in estimating regression parameters of the internal Cox model under population heterogeneity. The PHI method characterizes parameter heterogeneity by assuming an unknown cluster structure across datasets, and constructs an augmented log partial likelihood function with a fusion penalty to simultaneously estimate the cluster structure and adaptively incorporate summary statistics from external datasets. Estimation consistency and asymptotic normality are established for the proposed estimator. We further prove that the proposed estimator is asymptotically more efficient than the traditional maximum partial likelihood estimator under mild conditions. The PHI method also achieves consistency in estimating the underlying cluster structure across datasets. Simulation studies and brain tumor data analysis are used to investigate the finite-sample performance of the proposed method.
{"title":"Adaptive Incorporation of External Summary Information in the Cox Regression Under Population Heterogeneity.","authors":"Yiqi Li, Yuan Huang, Ying Sheng, Qingzhao Zhang","doi":"10.1002/sim.70371","DOIUrl":"https://doi.org/10.1002/sim.70371","url":null,"abstract":"<p><p>With the increasing availability of data from different sources, there is a growing interest in leveraging summary information from external studies to improve parameter estimation efficiency for the internal study that collects individual-level data. However, when analyzing right-censored survival data, covariate effects often vary across studies due to differences in study environments, research designs, and patients' inclusion criteria. Such heterogeneity, if not accounted for properly, can lead to biased estimates of covariate effects. In this article, we develop a Privacy-preserving and Heterogeneity-aware Integration (PHI) method to improve efficiency in estimating regression parameters of the internal Cox model under population heterogeneity. The PHI method characterizes parameter heterogeneity by assuming an unknown cluster structure across datasets, and constructs an augmented log partial likelihood function with a fusion penalty to simultaneously estimate the cluster structure and adaptively incorporate summary statistics from external datasets. Estimation consistency and asymptotic normality are established for the proposed estimator. We further prove that the proposed estimator is asymptotically more efficient than the traditional maximum partial likelihood estimator under mild conditions. The PHI method also achieves consistency in estimating the underlying cluster structure across datasets. Simulation studies and brain tumor data analysis are used to investigate the finite-sample performance of the proposed method.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70371"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prognosis and effective variable selection are essential in high-dimensional survival analysis, particularly for understanding long-term survival outcomes. The mixture cure rate model has been commonly adopted for subjects with exceptionally long survival times. However, traditional models usually assume log-linear effects of covariates, which may not capture the complex and nonlinear relationships in real-world data. Additionally, clinical observations reveal structural similarity between covariates that influence both patient cure rates and survival times. Existing methods typically estimate the two components of the mixture cure model independently, neglecting their inherent connections. To address these limitations, in this study, we enhance the conventional cure rate model by incorporating deep neural networks with a selection layer, while preserving the similarity structure between the cured and susceptible fractions. By integrating regularization constraints on the selection parameters and weight matrices within the neural network, the proposed approach simultaneously achieves effective variable selection and handles a series of complex nonlinear relationships within the data. To further enhance consistency in variable selection across both components of the cure model, a novel penalty is introduced, enabling the proposed model to identify key variables and enhance overall performance and interpretability in high-dimensional datasets. Through extensive simulation studies and real-world data analysis, the superior performance and robustness of the proposed approach are evident.
{"title":"Structured Nonlinear Cure Model With Deep Neural Networks for High-Dimensional Survival Analysis.","authors":"Xingdong Feng, Qiaoling Li, Xing Qin, Mengyun Wu, Liang Yu","doi":"10.1002/sim.70368","DOIUrl":"https://doi.org/10.1002/sim.70368","url":null,"abstract":"<p><p>Accurate prognosis and effective variable selection are essential in high-dimensional survival analysis, particularly for understanding long-term survival outcomes. The mixture cure rate model has been commonly adopted for subjects with exceptionally long survival times. However, traditional models usually assume log-linear effects of covariates, which may not capture the complex and nonlinear relationships in real-world data. Additionally, clinical observations reveal structural similarity between covariates that influence both patient cure rates and survival times. Existing methods typically estimate the two components of the mixture cure model independently, neglecting their inherent connections. To address these limitations, in this study, we enhance the conventional cure rate model by incorporating deep neural networks with a selection layer, while preserving the similarity structure between the cured and susceptible fractions. By integrating regularization constraints on the selection parameters and weight matrices within the neural network, the proposed approach simultaneously achieves effective variable selection and handles a series of complex nonlinear relationships within the data. To further enhance consistency in variable selection across both components of the cure model, a novel penalty is introduced, enabling the proposed model to identify key variables and enhance overall performance and interpretability in high-dimensional datasets. Through extensive simulation studies and real-world data analysis, the superior performance and robustness of the proposed approach are evident.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70368"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a test for the overall effect of interaction between DNA methylation and a set of single nucleotide polymorphisms on a quantitative phenotype. The developed inference procedure is based on a functional approach that extend existing regression models in functional data analysis. Through extensive simulations, we show that the proposed test effectively controls type I error rates and highlights increased empirical power over existing methods, particularly when multiple interactions are present. The use of the proposed test is illustrated with an application to data from obesity patients and controls.
{"title":"A Functional Approach to Testing Overall Effect of Interaction Between DNA Methylation and SNPs.","authors":"Yvelin Gansou, Karim Oualkacha, Marzia Angela Cremona, Lajmi Lakhal-Chaieb","doi":"10.1002/sim.70364","DOIUrl":"10.1002/sim.70364","url":null,"abstract":"<p><p>We introduce a test for the overall effect of interaction between DNA methylation and a set of single nucleotide polymorphisms on a quantitative phenotype. The developed inference procedure is based on a functional approach that extend existing regression models in functional data analysis. Through extensive simulations, we show that the proposed test effectively controls type I error rates and highlights increased empirical power over existing methods, particularly when multiple interactions are present. The use of the proposed test is illustrated with an application to data from obesity patients and controls.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70364"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12828112/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mendelian randomization (MR) has become an essential tool for causal inference in biomedical and public health research. By using genetic variants as instrumental variables, MR helps address unmeasured confounding and reverse causation, offering a quasi-experimental framework to evaluate causal effects of modifiable exposures on health outcomes. Despite its promise, MR faces substantial methodological challenges, including invalid instruments, weak instrument bias, and design complexities across different data structures. In this tutorial review, we aim to provide a systematic overview of MR methods for causal inference, emphasizing clarity of causal interpretation, study design comparisons, availability of software tools, and practical guidance for applied scientists. We organize the review around causal estimands, ensuring that analyses are anchored to well-defined causal questions. We discuss the problems of invalid and weak instruments, comparing available strategies for their detection and correction. We integrate discussions of population-based versus family-based MR designs, analyses based on individual-level versus summary-level data, and one-sample versus two-sample MR designs, highlighting their relative advantages and limitations. We also summarize recent methodological advances and software developments that extend MR to settings with many weak or invalid instruments and to modern high-dimensional omics data. Real-data applications, including UK Biobank and Alzheimer's disease proteomics studies, illustrate the use of these methods in practice. This review aims to serve as a tutorial-style reference for both methodologists and applied scientists.
{"title":"Mendelian Randomization Methods for Causal Inference: Estimands, Identification and Inference.","authors":"Minhao Yao, Anqi Wang, Xihao Li, Zhonghua Liu","doi":"10.1002/sim.70394","DOIUrl":"10.1002/sim.70394","url":null,"abstract":"<p><p>Mendelian randomization (MR) has become an essential tool for causal inference in biomedical and public health research. By using genetic variants as instrumental variables, MR helps address unmeasured confounding and reverse causation, offering a quasi-experimental framework to evaluate causal effects of modifiable exposures on health outcomes. Despite its promise, MR faces substantial methodological challenges, including invalid instruments, weak instrument bias, and design complexities across different data structures. In this tutorial review, we aim to provide a systematic overview of MR methods for causal inference, emphasizing clarity of causal interpretation, study design comparisons, availability of software tools, and practical guidance for applied scientists. We organize the review around causal estimands, ensuring that analyses are anchored to well-defined causal questions. We discuss the problems of invalid and weak instruments, comparing available strategies for their detection and correction. We integrate discussions of population-based versus family-based MR designs, analyses based on individual-level versus summary-level data, and one-sample versus two-sample MR designs, highlighting their relative advantages and limitations. We also summarize recent methodological advances and software developments that extend MR to settings with many weak or invalid instruments and to modern high-dimensional omics data. Real-data applications, including UK Biobank and Alzheimer's disease proteomics studies, illustrate the use of these methods in practice. This review aims to serve as a tutorial-style reference for both methodologists and applied scientists.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70394"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanwenqing He, Ke Wan, Toshio Shimokawa, Kazushi Maruo
In recent years, clinical data obtained from patient surveys and medical records have become increasingly pivotal in medical data science. These clinical data, collectively referred to as "real-world data (RWD)," are anticipated to play a key role in observational studies of specific diseases and in advancing personalized or precision medicine by identifying effective treatments for particular patient subgroups. Consequently, the estimation of heterogeneous treatment effects (HTEs) using RWD has garnered substantial attention. HTE estimation meaningfully contributes to precision medicine by enabling clinicians to make informed treatment decisions tailored to individual patient characteristics. Various treatment effect models for observational studies highlight the robust performance of bagging causal multivariate adaptive regression splines (MARS) (BCM). However, despite the notable efficacy of BCM, there remains potential for refinement. Here, we introduce a novel treatment effect model, the shrinkage causal bootstrap MARS method, which builds upon the following framework: initially, basis functions are estimated using transformed outcome bootstrap sampling MARS, followed by optimization of the model and parameter estimation via the group least absolute shrinkage and selection operator (LASSO) method. Our simulations demonstrate that the proposed method achieves improved mean square error and bias across most scenarios. Additionally, we validate the practical applicability of the method by implementing it on the ACTG 175 dataset.
{"title":"Extension of Bootstrap MARS With Group LASSO for Heterogeneous Treatment Effect Estimation.","authors":"Guanwenqing He, Ke Wan, Toshio Shimokawa, Kazushi Maruo","doi":"10.1002/sim.70370","DOIUrl":"https://doi.org/10.1002/sim.70370","url":null,"abstract":"<p><p>In recent years, clinical data obtained from patient surveys and medical records have become increasingly pivotal in medical data science. These clinical data, collectively referred to as \"real-world data (RWD),\" are anticipated to play a key role in observational studies of specific diseases and in advancing personalized or precision medicine by identifying effective treatments for particular patient subgroups. Consequently, the estimation of heterogeneous treatment effects (HTEs) using RWD has garnered substantial attention. HTE estimation meaningfully contributes to precision medicine by enabling clinicians to make informed treatment decisions tailored to individual patient characteristics. Various treatment effect models for observational studies highlight the robust performance of bagging causal multivariate adaptive regression splines (MARS) (BCM). However, despite the notable efficacy of BCM, there remains potential for refinement. Here, we introduce a novel treatment effect model, the shrinkage causal bootstrap MARS method, which builds upon the following framework: initially, basis functions are estimated using transformed outcome bootstrap sampling MARS, followed by optimization of the model and parameter estimation via the group least absolute shrinkage and selection operator (LASSO) method. Our simulations demonstrate that the proposed method achieves improved mean square error and bias across most scenarios. Additionally, we validate the practical applicability of the method by implementing it on the ACTG 175 dataset.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70370"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In immunotherapy, both the dose and the schedule of drug administration can significantly influence therapeutic effects by modulating immune system activation. Incorporating immune response measures into clinical trial designs offers an opportunity to enhance decision-making by leveraging their close association with therapeutic efficacy and toxicity. Motivated by settings where biomarker data indicate improved efficacy in biomarker-positive patients, we propose a dose-schedule optimization strategy tailored to each biomarker-defined subgroup, based on elicited utility functions that capture risk-benefit tradeoffs. We introduce a joint modeling framework that simultaneously evaluates immune response, toxicity, and efficacy, enabling information sharing across outcome types and patient subgroups. Our approach utilizes parsimonious yet flexible models designed specifically to address challenges due to small sample sizes commonly encountered in early-phase trials. Simulation studies demonstrate that the proposed design achieves desirable operating characteristics and effectively informs dose-schedule optimization.
{"title":"A Biomarker-Based Dose-Schedule Optimization Design for Immunotherapy Trials.","authors":"Yingjie Qiu, Yan Han, Beibei Guo","doi":"10.1002/sim.70357","DOIUrl":"10.1002/sim.70357","url":null,"abstract":"<p><p>In immunotherapy, both the dose and the schedule of drug administration can significantly influence therapeutic effects by modulating immune system activation. Incorporating immune response measures into clinical trial designs offers an opportunity to enhance decision-making by leveraging their close association with therapeutic efficacy and toxicity. Motivated by settings where biomarker data indicate improved efficacy in biomarker-positive patients, we propose a dose-schedule optimization strategy tailored to each biomarker-defined subgroup, based on elicited utility functions that capture risk-benefit tradeoffs. We introduce a joint modeling framework that simultaneously evaluates immune response, toxicity, and efficacy, enabling information sharing across outcome types and patient subgroups. Our approach utilizes parsimonious yet flexible models designed specifically to address challenges due to small sample sizes commonly encountered in early-phase trials. Simulation studies demonstrate that the proposed design achieves desirable operating characteristics and effectively informs dose-schedule optimization.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70357"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12828111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146030940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel methodology for robust regression analysis when traditional mean regression falls short due to the presence of outliers. Unlike conventional approaches that rely on simple random sampling (SRS), our methodology leverages median nomination sampling (MedNS) by utilizing readily available ranking information to obtain training data that more accurately captures the central tendency of the underlying population, thereby enhancing the representativeness of the sample in the presence of extensive outliers in the population. We propose a new loss function that integrates the extra rank information of MedNS data during the training phase of model fitting, thus offering a form of robust regression. Further, we provide an alternative approach that translates the median regression estimation using MedNS to corresponding problems under SRS. Through simulation studies, including a high-dimensional and a nonlinear regression setting, we evaluate the efficacy of our proposed approach compared to its SRS counterpart by comparing the integrated mean squared error of regression estimates. We observe that our proposed method provides higher relative efficiency (RE) compared to its SRS counterparts. Lastly, the proposed methods are applied to a real data set collected for body fat analysis in adults.
{"title":"Leveraging Rank Information for Robust Regression Analysis: A Nomination Sampling Approach.","authors":"Neve Loewen, Mohammad Jafari Jozani","doi":"10.1002/sim.70362","DOIUrl":"10.1002/sim.70362","url":null,"abstract":"<p><p>This paper introduces a novel methodology for robust regression analysis when traditional mean regression falls short due to the presence of outliers. Unlike conventional approaches that rely on simple random sampling (SRS), our methodology leverages median nomination sampling (MedNS) by utilizing readily available ranking information to obtain training data that more accurately captures the central tendency of the underlying population, thereby enhancing the representativeness of the sample in the presence of extensive outliers in the population. We propose a new loss function that integrates the extra rank information of MedNS data during the training phase of model fitting, thus offering a form of robust regression. Further, we provide an alternative approach that translates the median regression estimation using MedNS to corresponding problems under SRS. Through simulation studies, including a high-dimensional and a nonlinear regression setting, we evaluate the efficacy of our proposed approach compared to its SRS counterpart by comparing the integrated mean squared error of regression estimates. We observe that our proposed method provides higher relative efficiency (RE) compared to its SRS counterparts. Lastly, the proposed methods are applied to a real data set collected for body fat analysis in adults.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70362"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12826136/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146019728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janne Pott, Marco Palma, Yi Liu, Jasmine A Mack, Ulla Sovio, Gordon C S Smith, Jessica Barrett, Stephen Burgess
Background and aim: Mendelian randomization (MR) is a widely used tool to estimate causal effects using genetic variants as instrumental variables. MR is limited to cross-sectional summary statistics of different samples and time points to analyze time-varying effects. We aimed at using longitudinal summary statistics for an exposure in a multivariable MR setting and validating the effect estimates for the mean, slope, and within-individual variability.
Simulation study: We tested our approach in 12 scenarios for power and type I error, depending on shared instruments between the mean, slope, and variability, and regression model specifications. We observed high power to detect causal effects of the mean and slope throughout the simulation, but the variability effect was low powered in the case of shared SNPs between the mean and variability. Mis-specified regression models led to lower power and increased the type I error.
Real data application: We applied our approach to two real data sets (POPS, UK Biobank). We detected significant causal estimates for both the mean and the slope in both cases, but no independent effect of the variability. However, we only had weak instruments in both data sets.
Conclusion: We used a new approach to test a time-varying exposure for causal effects of the exposure's mean, slope and variability. The simulation with strong instruments seems promising but also highlights three crucial points: (1) The difficulty to define the correct exposure regression model, (2) the dependency on the genetic correlation, and (3) the lack of strong instruments in real data. Taken together, this demands a cautious evaluation of the results, accounting for known biology and the trajectory of the exposure.
背景与目的:孟德尔随机化(MR)是一种广泛使用的工具,以遗传变异作为工具变量来估计因果关系。MR仅限于对不同样本和时间点的横截面汇总统计来分析时变效应。我们的目的是对多变量MR环境下的暴露使用纵向汇总统计,并验证平均、斜率和个体内变异性的影响估计。模拟研究:我们根据平均值、斜率、可变性和回归模型规格之间的共享工具,在12种情况下测试了我们的方法的功率和I型误差。在整个模拟过程中,我们观察到均值和斜率的因果效应的检测功率很高,但在均值和变异性之间共享snp的情况下,变异性效应的检测功率很低。错误指定的回归模型导致较低的功率并增加了I型误差。真实数据应用:我们将我们的方法应用于两个真实数据集(POPS, UK Biobank)。在这两种情况下,我们都发现了均值和斜率的显著因果估计,但没有可变性的独立影响。然而,在这两个数据集中,我们只有较弱的仪器。结论:我们采用了一种新的方法来检验时变暴露对暴露的平均值、斜率和变异性的因果影响。使用强仪器的模拟似乎很有希望,但也突出了三个关键点:(1)难以定义正确的暴露回归模型;(2)对遗传相关性的依赖;(3)在实际数据中缺乏强仪器。综上所述,这需要对结果进行谨慎的评估,考虑到已知的生物学和暴露的轨迹。
{"title":"Mendelian Randomization With Longitudinal Exposure Data: Simulation Study and Real Data Application.","authors":"Janne Pott, Marco Palma, Yi Liu, Jasmine A Mack, Ulla Sovio, Gordon C S Smith, Jessica Barrett, Stephen Burgess","doi":"10.1002/sim.70378","DOIUrl":"10.1002/sim.70378","url":null,"abstract":"<p><strong>Background and aim: </strong>Mendelian randomization (MR) is a widely used tool to estimate causal effects using genetic variants as instrumental variables. MR is limited to cross-sectional summary statistics of different samples and time points to analyze time-varying effects. We aimed at using longitudinal summary statistics for an exposure in a multivariable MR setting and validating the effect estimates for the mean, slope, and within-individual variability.</p><p><strong>Simulation study: </strong>We tested our approach in 12 scenarios for power and type I error, depending on shared instruments between the mean, slope, and variability, and regression model specifications. We observed high power to detect causal effects of the mean and slope throughout the simulation, but the variability effect was low powered in the case of shared SNPs between the mean and variability. Mis-specified regression models led to lower power and increased the type I error.</p><p><strong>Real data application: </strong>We applied our approach to two real data sets (POPS, UK Biobank). We detected significant causal estimates for both the mean and the slope in both cases, but no independent effect of the variability. However, we only had weak instruments in both data sets.</p><p><strong>Conclusion: </strong>We used a new approach to test a time-varying exposure for causal effects of the exposure's mean, slope and variability. The simulation with strong instruments seems promising but also highlights three crucial points: (1) The difficulty to define the correct exposure regression model, (2) the dependency on the genetic correlation, and (3) the lack of strong instruments in real data. Taken together, this demands a cautious evaluation of the results, accounting for known biology and the trajectory of the exposure.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 1-2","pages":"e70378"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146019756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}