Michael R Schwob, Mevin B Hooten, Vagheesh Narasimhan
Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe.
机制统计模型通常用于研究生物过程的流动。例如,在景观遗传学中,目的是推断支配种群基因流动的空间机制。景观遗传学中的现有统计方法并不考虑数据的时间依赖性,而且计算量可能过大。我们采用贝叶斯分层二元模型来推断机制,该模型能很好地扩展大型数据集,并考虑空间和时间依赖性。我们为二元模型构建了一个由时空数据组成的全连接网络,并使用归一化复合似然来解释空间和时间上的依赖结构。我们建立了一个二元模型来解释物理统计模型中常见的物理机制,并将我们的方法应用于古人类 DNA 数据,以推断影响青铜时代欧洲人类运动的机制。
{"title":"Composite dyadic models for spatio-temporal data.","authors":"Michael R Schwob, Mevin B Hooten, Vagheesh Narasimhan","doi":"10.1093/biomtc/ujae107","DOIUrl":"10.1093/biomtc/ujae107","url":null,"abstract":"<p><p>Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our proposed test using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.
{"title":"Group sequential testing of a treatment effect using a surrogate marker.","authors":"Layla Parast, Jay Bartroff","doi":"10.1093/biomtc/ujae108","DOIUrl":"https://doi.org/10.1093/biomtc/ujae108","url":null,"abstract":"<p><p>The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our proposed test using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A stepped wedge design is an unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs is a standard practice to evaluate treatment effects accounting for clustering and adjusting for covariates, their properties under misspecification have not been systematically explored. In this article, we focus on model-based methods, including linear mixed models and generalized estimating equations with an independence, simple exchangeable, or nested exchangeable working correlation structure. We study when a potentially misspecified working model can offer consistent estimation of the marginal treatment effect estimands, which are defined nonparametrically with potential outcomes and may be functions of calendar time and/or exposure time. We prove a central result that consistency for nonparametric estimands usually requires a correctly specified treatment effect structure, but generally not the remaining aspects of the working model (functional form of covariates, random effects, and error distribution), and valid inference is obtained via the sandwich variance estimator. Furthermore, an additional g-computation step is required to achieve model-robust inference under non-identity link functions or for ratio estimands. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge cluster randomized trial.
阶梯楔形设计是一种单向交叉设计,在这种设计中,分组被随机分配到不同的治疗序列中。基于模型的阶梯楔形设计分析是评估治疗效果的标准做法,它考虑了聚类并调整了协变量,但尚未系统地探讨其在错误规范下的特性。本文重点讨论基于模型的方法,包括线性混合模型和具有独立、简单可交换或嵌套可交换工作相关结构的广义估计方程。我们研究了一个可能被错误定义的工作模型在什么情况下可以提供边际治疗效果估计值的一致性估计,边际治疗效果估计值是用潜在结果非参数定义的,可能是日历时间和/或暴露时间的函数。我们证明了一个核心结果,即非参数估计的一致性通常需要一个正确指定的治疗效果结构,但一般不需要工作模型的其他方面(协变量的函数形式、随机效应和误差分布),并且可以通过三明治方差估计器获得有效推论。此外,还需要额外的 g 计算步骤,才能在非同一性联系函数或比率估计值条件下实现模型可靠的推断。通过几个模拟实验和对已完成的阶梯楔形群随机试验的重新分析,对理论结果进行了说明。
{"title":"How to achieve model-robust inference in stepped wedge trials with model-based methods?","authors":"Bingkai Wang, Xueqi Wang, Fan Li","doi":"10.1093/biomtc/ujae123","DOIUrl":"10.1093/biomtc/ujae123","url":null,"abstract":"<p><p>A stepped wedge design is an unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs is a standard practice to evaluate treatment effects accounting for clustering and adjusting for covariates, their properties under misspecification have not been systematically explored. In this article, we focus on model-based methods, including linear mixed models and generalized estimating equations with an independence, simple exchangeable, or nested exchangeable working correlation structure. We study when a potentially misspecified working model can offer consistent estimation of the marginal treatment effect estimands, which are defined nonparametrically with potential outcomes and may be functions of calendar time and/or exposure time. We prove a central result that consistency for nonparametric estimands usually requires a correctly specified treatment effect structure, but generally not the remaining aspects of the working model (functional form of covariates, random effects, and error distribution), and valid inference is obtained via the sandwich variance estimator. Furthermore, an additional g-computation step is required to achieve model-robust inference under non-identity link functions or for ratio estimands. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge cluster randomized trial.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron J Molstad, Yanwei Cai, Alexander P Reiner, Charles Kooperberg, Wei Sun, Li Hsu
Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model that makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population.
基于基因预测蛋白表达的特定祖先全蛋白质组关联研究(PWAS)可以揭示某些祖先群体特有的复杂疾病病因。这些研究需要特定祖先的蛋白质表达模型作为 SNP 基因型的函数。为了改善在基因组研究中历来代表性不足的祖先人群的蛋白质表达预测,我们提出了一种新的惩罚性最大似然估计器,用于拟合祖先特异性联合蛋白质数量性状位点模型。我们的估计器借用了不同祖先群体的信息,同时允许异质性误差方差和回归系数。我们提出了模型的另一种参数化方法,使目标函数具有凸性和惩罚尺度不变性。为了提高计算效率,我们提出了一种近似版本的方法,并对其理论特性进行了研究。我们的方法大大提高了非洲血统个体蛋白质表达预测的准确性,并在下游的 PWAS 分析中发现了非洲血统人群中蛋白质表达与血脂特征之间的多种关联。
{"title":"Heterogeneity-aware integrative regression for ancestry-specific association studies.","authors":"Aaron J Molstad, Yanwei Cai, Alexander P Reiner, Charles Kooperberg, Wei Sun, Li Hsu","doi":"10.1093/biomtc/ujae109","DOIUrl":"10.1093/biomtc/ujae109","url":null,"abstract":"<p><p>Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model that makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a robust alternative to the maximum likelihood estimator (MLE) for the polytomous logistic regression model, known as the family of minimum Rènyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $alpha ge 0$, and include the MLE as a special case when $alpha =0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics.
本文提出了多态逻辑回归模型最大似然估计器(MLE)的稳健替代方法,即最小雷尼伪距(RP)估计器系列。所提出的最小 RP 估计器由一个调整参数 $alpha ge 0$ 参数化,并将 MLE 作为 $alpha =0$ 时的特例。这些估计器以及一系列基于 RP 的沃尔德类型检验,在存在误分类误差的情况下表现出卓越的性能。论文包括一项广泛的模拟研究和一个真实数据示例,以说明这些拟议统计量的稳健性。
{"title":"A new robust approach for the polytomous logistic regression model based on Rényi's pseudodistances.","authors":"Elena Castilla","doi":"10.1093/biomtc/ujae125","DOIUrl":"https://doi.org/10.1093/biomtc/ujae125","url":null,"abstract":"<p><p>This paper presents a robust alternative to the maximum likelihood estimator (MLE) for the polytomous logistic regression model, known as the family of minimum Rènyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $alpha ge 0$, and include the MLE as a special case when $alpha =0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Valancius, Herbert Pang, Jiawen Zhu, Stephen R Cole, Michele Jonsson Funk, Michael R Kosorok
We consider the challenges associated with causal inference in settings where data from a randomized trial are augmented with control data from an external source to improve efficiency in estimating the average treatment effect (ATE). This question is motivated by the SUNFISH trial, which investigated the effect of risdiplam on motor function in patients with spinal muscular atrophy. While the original analysis used only data generated by the trial, we explore an alternative analysis incorporating external controls from the placebo arm of a historical trial. We cast the setting into a formal causal inference framework and show how these designs are characterized by a lack of full randomization to treatment and heightened dependency on modeling. To address this, we outline sufficient causal assumptions about the exchangeability between the internal and external controls to identify the ATE and establish a connection with novel graphical criteria. Furthermore, we propose estimators, review efficiency bounds, develop an approach for efficient doubly robust estimation even when unknown nuisance models are estimated with flexible machine learning methods, suggest model diagnostics, and demonstrate finite-sample performance of the methods through a simulation study. The ideas and methods are illustrated through their application to the SUNFISH trial, where we find that external controls can increase the efficiency of treatment effect estimation.
{"title":"A causal inference framework for leveraging external controls in hybrid trials.","authors":"Michael Valancius, Herbert Pang, Jiawen Zhu, Stephen R Cole, Michele Jonsson Funk, Michael R Kosorok","doi":"10.1093/biomtc/ujae095","DOIUrl":"10.1093/biomtc/ujae095","url":null,"abstract":"<p><p>We consider the challenges associated with causal inference in settings where data from a randomized trial are augmented with control data from an external source to improve efficiency in estimating the average treatment effect (ATE). This question is motivated by the SUNFISH trial, which investigated the effect of risdiplam on motor function in patients with spinal muscular atrophy. While the original analysis used only data generated by the trial, we explore an alternative analysis incorporating external controls from the placebo arm of a historical trial. We cast the setting into a formal causal inference framework and show how these designs are characterized by a lack of full randomization to treatment and heightened dependency on modeling. To address this, we outline sufficient causal assumptions about the exchangeability between the internal and external controls to identify the ATE and establish a connection with novel graphical criteria. Furthermore, we propose estimators, review efficiency bounds, develop an approach for efficient doubly robust estimation even when unknown nuisance models are estimated with flexible machine learning methods, suggest model diagnostics, and demonstrate finite-sample performance of the methods through a simulation study. The ideas and methods are illustrated through their application to the SUNFISH trial, where we find that external controls can increase the efficiency of treatment effect estimation.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11546536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142602843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fei Jiang, Ge Zhao, Rosa Rodriguez-Monguio, Yanyuan Ma
With the ever advancing of modern technologies, it has become increasingly common that the number of collected confounders exceeds the number of subjects in a data set. However, matching based methods for estimating causal treatment effect in their original forms are not capable of handling high-dimensional confounders, and their various modified versions lack statistical support and valid inference tools. In this article, we propose a new approach for estimating causal treatment effect, defined as the difference of the restricted mean survival time (RMST) under different treatments in high-dimensional setting for survival data. We combine the factor model and the sufficient dimension reduction techniques to construct propensity score and prognostic score. Based on these scores, we develop a kernel based doubly robust estimator of the RMST difference. We demonstrate its link to matching and establish the consistency and asymptotic normality of the estimator. We illustrate our method by analyzing a dataset from a study aimed at comparing the effects of two alternative treatments on the RMST of patients with diffuse large B cell lymphoma.
随着现代技术的不断进步,收集到的混杂因素数量超过数据集中受试者数量的情况越来越普遍。然而,基于配对的因果治疗效果估计方法的原始形式无法处理高维混杂因素,其各种修改版本也缺乏统计支持和有效的推断工具。在本文中,我们提出了一种估算因果治疗效果的新方法,即在高维生存数据环境下,将因果治疗效果定义为不同治疗下受限平均生存时间(RMST)之差。我们结合因子模型和充分降维技术来构建倾向评分和预后评分。基于这些分数,我们开发了基于核的 RMST 差异双重稳健估计器。我们证明了它与匹配的联系,并建立了估计器的一致性和渐近正态性。我们通过分析一项研究的数据集来说明我们的方法,该研究旨在比较两种替代治疗方法对弥漫大 B 细胞淋巴瘤患者 RMST 的影响。
{"title":"Causal effect estimation in survival analysis with high dimensional confounders.","authors":"Fei Jiang, Ge Zhao, Rosa Rodriguez-Monguio, Yanyuan Ma","doi":"10.1093/biomtc/ujae110","DOIUrl":"https://doi.org/10.1093/biomtc/ujae110","url":null,"abstract":"<p><p>With the ever advancing of modern technologies, it has become increasingly common that the number of collected confounders exceeds the number of subjects in a data set. However, matching based methods for estimating causal treatment effect in their original forms are not capable of handling high-dimensional confounders, and their various modified versions lack statistical support and valid inference tools. In this article, we propose a new approach for estimating causal treatment effect, defined as the difference of the restricted mean survival time (RMST) under different treatments in high-dimensional setting for survival data. We combine the factor model and the sufficient dimension reduction techniques to construct propensity score and prognostic score. Based on these scores, we develop a kernel based doubly robust estimator of the RMST difference. We demonstrate its link to matching and establish the consistency and asymptotic normality of the estimator. We illustrate our method by analyzing a dataset from a study aimed at comparing the effects of two alternative treatments on the RMST of patients with diffuse large B cell lymphoma.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephanie M Wu, Matthew R Williams, Terrance D Savitsky, Briana J K Stephenson
Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States.
{"title":"Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a supervised weighted overfitted latent class analysis.","authors":"Stephanie M Wu, Matthew R Williams, Terrance D Savitsky, Briana J K Stephenson","doi":"10.1093/biomtc/ujae122","DOIUrl":"10.1093/biomtc/ujae122","url":null,"abstract":"<p><p>Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11518851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes data.
{"title":"Wasserstein regression with empirical measures and density estimation for sparse data.","authors":"Yidong Zhou, Hans-Georg Müller","doi":"10.1093/biomtc/ujae127","DOIUrl":"https://doi.org/10.1093/biomtc/ujae127","url":null,"abstract":"<p><p>The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes data.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eva Biswas, Andee Kaplan, Mark S Kaiser, Daniel J Nordman
Binary spatial observations arise in environmental and ecological studies, where Markov random field (MRF) models are often applied. Despite the prevalence and the long history of MRF models for spatial binary data, appropriate model diagnostics have remained an unresolved issue in practice. A complicating factor is that such models involve neighborhood specifications, which are difficult to assess for binary data. To address this, we propose a formal goodness-of-fit (GOF) test for diagnosing an MRF model for spatial binary values. The test statistic involves a type of conditional Moran's I based on the fitted conditional probabilities, which can detect departures in model form, including neighborhood structure. Numerical studies show that the GOF test can perform well in detecting deviations from a null model, with a focus on neighborhoods as a difficult issue. We illustrate the spatial test with an application to Besag's historical endive data as well as the breeding pattern of grasshopper sparrows across Iowa.
{"title":"A formal goodness-of-fit test for spatial binary Markov random field models.","authors":"Eva Biswas, Andee Kaplan, Mark S Kaiser, Daniel J Nordman","doi":"10.1093/biomtc/ujae119","DOIUrl":"https://doi.org/10.1093/biomtc/ujae119","url":null,"abstract":"<p><p>Binary spatial observations arise in environmental and ecological studies, where Markov random field (MRF) models are often applied. Despite the prevalence and the long history of MRF models for spatial binary data, appropriate model diagnostics have remained an unresolved issue in practice. A complicating factor is that such models involve neighborhood specifications, which are difficult to assess for binary data. To address this, we propose a formal goodness-of-fit (GOF) test for diagnosing an MRF model for spatial binary values. The test statistic involves a type of conditional Moran's I based on the fitted conditional probabilities, which can detect departures in model form, including neighborhood structure. Numerical studies show that the GOF test can perform well in detecting deviations from a null model, with a focus on neighborhoods as a difficult issue. We illustrate the spatial test with an application to Besag's historical endive data as well as the breeding pattern of grasshopper sparrows across Iowa.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}