The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes data.
{"title":"Wasserstein regression with empirical measures and density estimation for sparse data.","authors":"Yidong Zhou, Hans-Georg Müller","doi":"10.1093/biomtc/ujae127","DOIUrl":"https://doi.org/10.1093/biomtc/ujae127","url":null,"abstract":"<p><p>The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes data.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eva Biswas, Andee Kaplan, Mark S Kaiser, Daniel J Nordman
Binary spatial observations arise in environmental and ecological studies, where Markov random field (MRF) models are often applied. Despite the prevalence and the long history of MRF models for spatial binary data, appropriate model diagnostics have remained an unresolved issue in practice. A complicating factor is that such models involve neighborhood specifications, which are difficult to assess for binary data. To address this, we propose a formal goodness-of-fit (GOF) test for diagnosing an MRF model for spatial binary values. The test statistic involves a type of conditional Moran's I based on the fitted conditional probabilities, which can detect departures in model form, including neighborhood structure. Numerical studies show that the GOF test can perform well in detecting deviations from a null model, with a focus on neighborhoods as a difficult issue. We illustrate the spatial test with an application to Besag's historical endive data as well as the breeding pattern of grasshopper sparrows across Iowa.
{"title":"A formal goodness-of-fit test for spatial binary Markov random field models.","authors":"Eva Biswas, Andee Kaplan, Mark S Kaiser, Daniel J Nordman","doi":"10.1093/biomtc/ujae119","DOIUrl":"https://doi.org/10.1093/biomtc/ujae119","url":null,"abstract":"<p><p>Binary spatial observations arise in environmental and ecological studies, where Markov random field (MRF) models are often applied. Despite the prevalence and the long history of MRF models for spatial binary data, appropriate model diagnostics have remained an unresolved issue in practice. A complicating factor is that such models involve neighborhood specifications, which are difficult to assess for binary data. To address this, we propose a formal goodness-of-fit (GOF) test for diagnosing an MRF model for spatial binary values. The test statistic involves a type of conditional Moran's I based on the fitted conditional probabilities, which can detect departures in model form, including neighborhood structure. Numerical studies show that the GOF test can perform well in detecting deviations from a null model, with a focus on neighborhoods as a difficult issue. We illustrate the spatial test with an application to Besag's historical endive data as well as the breeding pattern of grasshopper sparrows across Iowa.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Perreault, Gracia Y Dong, Alex Stringer, Hwashin Shin, Patrick E Brown
Over the last three decades, case-crossover designs have found many applications in health sciences, especially in air pollution epidemiology. They are typically used, in combination with partial likelihood techniques, to define a conditional logistic model for the responses, usually health outcomes, conditional on the exposures. Despite the fact that conditional logistic models have been shown equivalent, in typical air pollution epidemiology setups, to specific instances of the well-known Poisson time series model, it is often claimed that they cannot allow for overdispersion. This paper clarifies the relationship between case-crossover designs, the models that ensue from their use, and overdispersion. In particular, we propose to relax the assumption of independence between individuals traditionally made in case-crossover analyses, in order to explicitly introduce overdispersion in the conditional logistic model. As we show, the resulting overdispersed conditional logistic model coincides with the overdispersed, conditional Poisson model, in the sense that their likelihoods are simple re-expressions of one another. We further provide the technical details of a Bayesian implementation of the proposed case-crossover model, which we use to demonstrate, by means of a large simulation study, that standard case-crossover models can lead to dramatically underestimated coverage probabilities, while the proposed models do not. We also perform an illustrative analysis of the association between air pollution and morbidity in Toronto, Canada, which shows that the proposed models are more robust than standard ones to outliers such as those associated with public holidays.
{"title":"Case-crossover designs and overdispersion with application to air pollution epidemiology.","authors":"Samuel Perreault, Gracia Y Dong, Alex Stringer, Hwashin Shin, Patrick E Brown","doi":"10.1093/biomtc/ujae117","DOIUrl":"https://doi.org/10.1093/biomtc/ujae117","url":null,"abstract":"<p><p>Over the last three decades, case-crossover designs have found many applications in health sciences, especially in air pollution epidemiology. They are typically used, in combination with partial likelihood techniques, to define a conditional logistic model for the responses, usually health outcomes, conditional on the exposures. Despite the fact that conditional logistic models have been shown equivalent, in typical air pollution epidemiology setups, to specific instances of the well-known Poisson time series model, it is often claimed that they cannot allow for overdispersion. This paper clarifies the relationship between case-crossover designs, the models that ensue from their use, and overdispersion. In particular, we propose to relax the assumption of independence between individuals traditionally made in case-crossover analyses, in order to explicitly introduce overdispersion in the conditional logistic model. As we show, the resulting overdispersed conditional logistic model coincides with the overdispersed, conditional Poisson model, in the sense that their likelihoods are simple re-expressions of one another. We further provide the technical details of a Bayesian implementation of the proposed case-crossover model, which we use to demonstrate, by means of a large simulation study, that standard case-crossover models can lead to dramatically underestimated coverage probabilities, while the proposed models do not. We also perform an illustrative analysis of the association between air pollution and morbidity in Toronto, Canada, which shows that the proposed models are more robust than standard ones to outliers such as those associated with public holidays.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingche Guo, Bin Yang, Ji Meng Loh, Qinxia Wang, Yuanjia Wang
Mental disorders present challenges in diagnosis and treatment due to their complex and heterogeneous nature. Electroencephalogram (EEG) has shown promise as a source of potential biomarkers for these disorders. However, existing methods for analyzing EEG signals have limitations in addressing heterogeneity and capturing complex brain activity patterns between regions. This paper proposes a novel random effects state-space model (RESSM) for analyzing large-scale multi-channel resting-state EEG signals, accounting for the heterogeneity of brain connectivities between groups and individual subjects. We incorporate multi-level random effects for temporal dynamical and spatial mapping matrices and address non-stationarity so that the brain connectivity patterns can vary over time. The model is fitted under a Bayesian hierarchical model framework coupled with a Gibbs sampler. Compared to previous mixed-effects state-space models, we directly model high-dimensional random effects matrices of interest without structural constraints and tackle the challenge of identifiability. Through extensive simulation studies, we demonstrate that our approach yields valid estimation and inference. We apply RESSM to a multi-site clinical trial of major depressive disorder (MDD). Our analysis uncovers significant differences in resting-state brain temporal dynamics among MDD patients compared to healthy individuals. In addition, we show the subject-level EEG features derived from RESSM exhibit a superior predictive value for the heterogeneous treatment effect compared to the EEG frequency band power, suggesting the potential of EEG as a valuable biomarker for MDD.
{"title":"A hierarchical random effects state-space model for modeling brain activities from electroencephalogram data.","authors":"Xingche Guo, Bin Yang, Ji Meng Loh, Qinxia Wang, Yuanjia Wang","doi":"10.1093/biomtc/ujae130","DOIUrl":"10.1093/biomtc/ujae130","url":null,"abstract":"<p><p>Mental disorders present challenges in diagnosis and treatment due to their complex and heterogeneous nature. Electroencephalogram (EEG) has shown promise as a source of potential biomarkers for these disorders. However, existing methods for analyzing EEG signals have limitations in addressing heterogeneity and capturing complex brain activity patterns between regions. This paper proposes a novel random effects state-space model (RESSM) for analyzing large-scale multi-channel resting-state EEG signals, accounting for the heterogeneity of brain connectivities between groups and individual subjects. We incorporate multi-level random effects for temporal dynamical and spatial mapping matrices and address non-stationarity so that the brain connectivity patterns can vary over time. The model is fitted under a Bayesian hierarchical model framework coupled with a Gibbs sampler. Compared to previous mixed-effects state-space models, we directly model high-dimensional random effects matrices of interest without structural constraints and tackle the challenge of identifiability. Through extensive simulation studies, we demonstrate that our approach yields valid estimation and inference. We apply RESSM to a multi-site clinical trial of major depressive disorder (MDD). Our analysis uncovers significant differences in resting-state brain temporal dynamics among MDD patients compared to healthy individuals. In addition, we show the subject-level EEG features derived from RESSM exhibit a superior predictive value for the heterogeneous treatment effect compared to the EEG frequency band power, suggesting the potential of EEG as a valuable biomarker for MDD.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11540184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of sensors is opening new avenues in several fields of activity. Concerning agricultural crops, complex combinations of agri-environmental dynamics, such as soil and climate variables, are now commonly recorded. These new kinds of measurements are an opportunity to improve knowledge of the drivers of crop yield and crop quality at harvest. This involves renewing statistical approaches to account for the combined variations of these dynamic variables, here considered as temporal variables. The objective of the paper is to estimate an interpretable model to study the influence of the two combined inputs on a scalar output. A Sparse and Structured Procedure is proposed to Identify Combined Effects of Formatted temporal Predictors, hereafter denoted S piceFP. The method is based on the transformation of both temporal variables into categorical variables by defining joint modalities, from which a collection of multiple regression models is then derived. The regressors are the frequencies associated with joint class intervals. The class intervals and related regression coefficients are determined using a generalized fused lasso. S piceFP is a generic and exploratory approach. The simulations we performed show that it is flexible enough to select the non-null or influential modalities of values. A motivating example for grape quality is presented.
传感器的发展为多个活动领域开辟了新的途径。在农作物方面,土壤和气候变量等农业环境动态的复杂组合现在已被普遍记录下来。这些新的测量手段为我们提供了一个机会,可以更好地了解作物产量和收获时作物质量的驱动因素。这就需要更新统计方法,以考虑这些动态变量的综合变化,在此将其视为时间变量。本文的目的是估算一个可解释的模型,以研究这两个综合输入对标量输出的影响。本文提出了一种稀疏和结构化程序来识别格式化时间预测因子的组合效应,以下简称 S piceFP。该方法的基础是通过定义联合模式将两个时间变量转换为分类变量,然后从中导出一系列多元回归模型。回归因子是与联合类别区间相关的频率。类区间和相关回归系数是通过广义融合套索确定的。S piceFP 是一种通用的探索性方法。我们进行的模拟显示,它在选择非空或有影响的数值模式时具有足够的灵活性。我们以葡萄质量为例进行了说明。
{"title":"An exploratory penalized regression to identify combined effects of temporal variables-application to agri-environmental issues.","authors":"Bénedicte Fontez, Patrice Loisel, Thierry Simonneau, Nadine Hilgert","doi":"10.1093/biomtc/ujae134","DOIUrl":"https://doi.org/10.1093/biomtc/ujae134","url":null,"abstract":"<p><p>The development of sensors is opening new avenues in several fields of activity. Concerning agricultural crops, complex combinations of agri-environmental dynamics, such as soil and climate variables, are now commonly recorded. These new kinds of measurements are an opportunity to improve knowledge of the drivers of crop yield and crop quality at harvest. This involves renewing statistical approaches to account for the combined variations of these dynamic variables, here considered as temporal variables. The objective of the paper is to estimate an interpretable model to study the influence of the two combined inputs on a scalar output. A Sparse and Structured Procedure is proposed to Identify Combined Effects of Formatted temporal Predictors, hereafter denoted S piceFP. The method is based on the transformation of both temporal variables into categorical variables by defining joint modalities, from which a collection of multiple regression models is then derived. The regressors are the frequencies associated with joint class intervals. The class intervals and related regression coefficients are determined using a generalized fused lasso. S piceFP is a generic and exploratory approach. The simulations we performed show that it is flexible enough to select the non-null or influential modalities of values. A motivating example for grape quality is presented.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142692652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Response-adaptive randomization (RAR) has been studied extensively in conventional, single-stage clinical trials, where it has been shown to yield ethical and statistical benefits, especially in trials with many treatment arms. However, RAR and its potential benefits are understudied in sequential multiple assignment randomized trials (SMARTs), which are the gold-standard trial design for evaluation of multi-stage treatment regimes. We propose a suite of RAR algorithms for SMARTs based on Thompson Sampling (TS), a widely used RAR method in single-stage trials in which treatment randomization probabilities are aligned with the estimated probability that the treatment is optimal. We focus on two common objectives in SMARTs: (1) comparison of the regimes embedded in the trial and (2) estimation of an optimal embedded regime. We develop valid post-study inferential procedures for treatment regimes under the proposed algorithms. This is nontrivial, as even in single-stage settings standard estimators of an average treatment effect can have nonnormal asymptotic behavior under RAR. Our algorithms are the first for RAR in multi-stage trials that account for non-standard limiting behavior due to RAR. Empirical studies based on real-world SMARTs show that TS can improve in-trial subject outcomes without sacrificing efficiency for post-trial comparisons.
{"title":"Adaptive randomization methods for sequential multiple assignment randomized trials (smarts) via thompson sampling.","authors":"Peter Norwood, Marie Davidian, Eric Laber","doi":"10.1093/biomtc/ujae152","DOIUrl":"10.1093/biomtc/ujae152","url":null,"abstract":"<p><p>Response-adaptive randomization (RAR) has been studied extensively in conventional, single-stage clinical trials, where it has been shown to yield ethical and statistical benefits, especially in trials with many treatment arms. However, RAR and its potential benefits are understudied in sequential multiple assignment randomized trials (SMARTs), which are the gold-standard trial design for evaluation of multi-stage treatment regimes. We propose a suite of RAR algorithms for SMARTs based on Thompson Sampling (TS), a widely used RAR method in single-stage trials in which treatment randomization probabilities are aligned with the estimated probability that the treatment is optimal. We focus on two common objectives in SMARTs: (1) comparison of the regimes embedded in the trial and (2) estimation of an optimal embedded regime. We develop valid post-study inferential procedures for treatment regimes under the proposed algorithms. This is nontrivial, as even in single-stage settings standard estimators of an average treatment effect can have nonnormal asymptotic behavior under RAR. Our algorithms are the first for RAR in multi-stage trials that account for non-standard limiting behavior due to RAR. Empirical studies based on real-world SMARTs show that TS can improve in-trial subject outcomes without sacrificing efficiency for post-trial comparisons.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11647911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Van Tuan Nguyen, Adeline Fermanian, Antoine Barbieri, Sarah Zohar, Anne-Sophie Jannot, Simon Bussy, Agathe Guilloux
This paper introduces a prognostic method called FLASH that addresses the problem of joint modeling of longitudinal data and censored durations when a large number of both longitudinal and time-independent features are available. In the literature, standard joint models are either of the shared random effect or joint latent class type. Combining ideas from both worlds and using appropriate regularization techniques, we define a new model with the ability to automatically identify significant prognostic longitudinal features in a high-dimensional context, which is of increasing importance in many areas such as personalized medicine or churn prediction. We develop an estimation methodology based on the expectation-maximization algorithm and provide an efficient implementation. The statistical performance of the method is demonstrated both in extensive Monte Carlo simulation studies and on publicly available medical datasets. Our method significantly outperforms the state-of-the-art joint models in terms of C-index in a so-called "real-time" prediction setting, with a computational speed that is orders of magnitude faster than competing methods. In addition, our model automatically identifies significant features that are relevant from a practical point of view, making it interpretable, which is of the greatest importance for a prognostic algorithm in healthcare.
本文介绍了一种名为 "FLASH "的预后方法,它可以解决在有大量纵向特征和时间无关特征的情况下,对纵向数据和删减持续时间进行联合建模的问题。在文献中,标准的联合模型要么是共享随机效应模型,要么是联合潜类模型。结合这两个领域的思想并使用适当的正则化技术,我们定义了一种新模型,它能够在高维背景下自动识别重要的预后纵向特征,这在个性化医疗或流失预测等许多领域越来越重要。我们开发了一种基于期望最大化算法的估计方法,并提供了一种高效的实现方法。该方法的统计性能在大量蒙特卡罗模拟研究和公开医疗数据集上都得到了验证。在所谓的 "实时 "预测环境下,我们的方法在 C 指数方面明显优于最先进的联合模型,计算速度比其他竞争方法快了几个数量级。此外,我们的模型还能自动识别与实际情况相关的重要特征,使其具有可解释性,这对医疗预后算法来说至关重要。
{"title":"An efficient joint model for high dimensional longitudinal and survival data via generic association features.","authors":"Van Tuan Nguyen, Adeline Fermanian, Antoine Barbieri, Sarah Zohar, Anne-Sophie Jannot, Simon Bussy, Agathe Guilloux","doi":"10.1093/biomtc/ujae149","DOIUrl":"https://doi.org/10.1093/biomtc/ujae149","url":null,"abstract":"<p><p>This paper introduces a prognostic method called FLASH that addresses the problem of joint modeling of longitudinal data and censored durations when a large number of both longitudinal and time-independent features are available. In the literature, standard joint models are either of the shared random effect or joint latent class type. Combining ideas from both worlds and using appropriate regularization techniques, we define a new model with the ability to automatically identify significant prognostic longitudinal features in a high-dimensional context, which is of increasing importance in many areas such as personalized medicine or churn prediction. We develop an estimation methodology based on the expectation-maximization algorithm and provide an efficient implementation. The statistical performance of the method is demonstrated both in extensive Monte Carlo simulation studies and on publicly available medical datasets. Our method significantly outperforms the state-of-the-art joint models in terms of C-index in a so-called \"real-time\" prediction setting, with a computational speed that is orders of magnitude faster than competing methods. In addition, our model automatically identifies significant features that are relevant from a practical point of view, making it interpretable, which is of the greatest importance for a prognostic algorithm in healthcare.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts.
{"title":"Debiased high-dimensional regression calibration for errors-in-variables log-contrast models.","authors":"Huali Zhao, Tianying Wang","doi":"10.1093/biomtc/ujae153","DOIUrl":"https://doi.org/10.1093/biomtc/ujae153","url":null,"abstract":"<p><p>Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bonnie B Smith, Yujing Gao, Shu Yang, Ravi Varadhan, Andrea J Apter, Daniel O Scharfstein
Many trials are designed to collect outcomes at or around pre-specified times after randomization. If there is variability in the times when participants are actually assessed, this can pose a challenge to learning the effect of treatment, since not all participants have outcome assessments at the times of interest. Furthermore, observed outcome values may not be representative of all participants' outcomes at a given time. Methods have been developed that account for some types of such irregular and informative assessment times; however, since these methods rely on untestable assumptions, sensitivity analyses are needed. We develop a sensitivity analysis methodology that is benchmarked at the explainable assessment (EA) assumption, under which assessment and outcomes at each time are related only through data collected prior to that time. Our method uses an exponential tilting assumption, governed by a sensitivity analysis parameter, that posits deviations from the EA assumption. Our inferential strategy is based on a new influence function-based, augmented inverse intensity-weighted estimator. Our approach allows for flexible semiparametric modeling of the observed data, which is separated from specification of the sensitivity parameter. We apply our method to a randomized trial of low-income individuals with uncontrolled asthma, and we illustrate implementation of our estimation procedure in detail.
{"title":"Semi-parametric sensitivity analysis for trials with irregular and informative assessment times.","authors":"Bonnie B Smith, Yujing Gao, Shu Yang, Ravi Varadhan, Andrea J Apter, Daniel O Scharfstein","doi":"10.1093/biomtc/ujae154","DOIUrl":"10.1093/biomtc/ujae154","url":null,"abstract":"<p><p>Many trials are designed to collect outcomes at or around pre-specified times after randomization. If there is variability in the times when participants are actually assessed, this can pose a challenge to learning the effect of treatment, since not all participants have outcome assessments at the times of interest. Furthermore, observed outcome values may not be representative of all participants' outcomes at a given time. Methods have been developed that account for some types of such irregular and informative assessment times; however, since these methods rely on untestable assumptions, sensitivity analyses are needed. We develop a sensitivity analysis methodology that is benchmarked at the explainable assessment (EA) assumption, under which assessment and outcomes at each time are related only through data collected prior to that time. Our method uses an exponential tilting assumption, governed by a sensitivity analysis parameter, that posits deviations from the EA assumption. Our inferential strategy is based on a new influence function-based, augmented inverse intensity-weighted estimator. Our approach allows for flexible semiparametric modeling of the observed data, which is separated from specification of the sensitivity parameter. We apply our method to a randomized trial of low-income individuals with uncontrolled asthma, and we illustrate implementation of our estimation procedure in detail.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142891794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}