首页 > 最新文献

Biometrics最新文献

英文 中文
Causal machine learning for heterogeneous treatment effects in the presence of missing outcome data. 在缺少结果数据的情况下,异质性治疗效果的因果机器学习。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf098
Matthew Pryce, Karla Diaz-Ordaz, Ruth H Keogh, Stijn Vansteelandt

When estimating heterogeneous treatment effects, missing outcome data can complicate treatment effect estimation, causing certain subgroups of the population to be poorly represented. In this work, we discuss this commonly overlooked problem and consider the impact that missing at random outcome data has on causal machine learning estimators for the conditional average treatment effect (CATE). We propose 2 de-biased machine learning estimators for the CATE, the mDR-learner, and mEP-learner, which address the issue of under-representation by integrating inverse probability of censoring weights into the DR-learner and EP-learner, respectively. We show that under reasonable conditions, these estimators are oracle efficient and illustrate their favorable performance through simulated data settings, comparing them to existing CATE estimators, including comparison to estimators that use common missing data techniques. We present an example of their application using the GBSG2 trial, exploring treatment effect heterogeneity when comparing hormonal therapies to non-hormonal therapies among breast cancer patients post surgery, and offer guidance on the decisions a practitioner must make when implementing these estimators.

在估计异质性治疗效果时,缺少结局数据会使治疗效果估计复杂化,导致人群的某些亚组代表性不足。在这项工作中,我们讨论了这个经常被忽视的问题,并考虑了随机结果数据缺失对条件平均处理效果(CATE)的因果机器学习估计器的影响。我们为CATE, mDR-learner和mEP-learner提出了2个去偏机器学习估计器,它们分别通过将审查权的逆概率集成到DR-learner和EP-learner中来解决代表性不足的问题。我们表明,在合理的条件下,这些估计器是oracle高效的,并通过模拟数据设置说明它们的良好性能,将它们与现有的CATE估计器进行比较,包括与使用常见缺失数据技术的估计器进行比较。我们在GBSG2试验中展示了它们的应用实例,在比较乳腺癌术后患者的激素治疗与非激素治疗时,探索治疗效果的异质性,并为医生在实施这些评估时必须做出的决定提供指导。
{"title":"Causal machine learning for heterogeneous treatment effects in the presence of missing outcome data.","authors":"Matthew Pryce, Karla Diaz-Ordaz, Ruth H Keogh, Stijn Vansteelandt","doi":"10.1093/biomtc/ujaf098","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf098","url":null,"abstract":"<p><p>When estimating heterogeneous treatment effects, missing outcome data can complicate treatment effect estimation, causing certain subgroups of the population to be poorly represented. In this work, we discuss this commonly overlooked problem and consider the impact that missing at random outcome data has on causal machine learning estimators for the conditional average treatment effect (CATE). We propose 2 de-biased machine learning estimators for the CATE, the mDR-learner, and mEP-learner, which address the issue of under-representation by integrating inverse probability of censoring weights into the DR-learner and EP-learner, respectively. We show that under reasonable conditions, these estimators are oracle efficient and illustrate their favorable performance through simulated data settings, comparing them to existing CATE estimators, including comparison to estimators that use common missing data techniques. We present an example of their application using the GBSG2 trial, exploring treatment effect heterogeneity when comparing hormonal therapies to non-hormonal therapies among breast cancer patients post surgery, and offer guidance on the decisions a practitioner must make when implementing these estimators.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144752242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the heterogeneity in recurrent episode lengths based on quantile regression. 基于分位数回归探讨复发发作长度的异质性。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf122
Yi Liu, Guillermo E Umpierrez, Limin Peng

Recurrent episode data frequently arise in chronic disease studies when an event of interest occurs repeatedly and each occurrence lasts for a random period of time. Understanding the heterogeneity in recurrent episode lengths can help guide dynamic and customized disease management. However, there has been relative sparse attention to methods tailored to this end. Existing approaches either do not confer direct interpretation on episode lengths or involve restrictive or unrealistic distributional assumptions, such as exchangeability of within-individual episode lengths. In this work, we propose a modeling strategy that overcomes these limitations through adopting quantile regression and sensibly incorporating time-dependent covariates. Treating recurrent episodes as clustered data, we develop an estimation procedure that properly handles the special complications, including dependent censoring, dependent truncation, and informative cluster size. Our estimation procedure is computationally simple and yields estimators with desirable asymptotic properties. Our numerical studies demonstrate the advantages of the proposed method over naive adaptations of existing approaches.

在慢性病研究中,当感兴趣的事件反复发生且每次发生持续一段随机时间时,经常出现复发性发作数据。了解复发期长度的异质性有助于指导动态和定制的疾病管理。然而,对为此目的量身定制的方法的关注相对较少。现有的方法要么不能直接解释情节长度,要么涉及限制性或不切实际的分布假设,例如个体情节长度的可交换性。在这项工作中,我们提出了一种建模策略,通过采用分位数回归和合理地结合时间相关协变量来克服这些限制。将反复发作的事件作为聚类数据,我们开发了一种估计程序,可以适当地处理特殊的并发症,包括依赖审查,依赖截断和信息聚类大小。我们的估计过程计算简单,得到的估计量具有理想的渐近性质。我们的数值研究表明,所提出的方法优于现有方法的幼稚适应。
{"title":"Exploring the heterogeneity in recurrent episode lengths based on quantile regression.","authors":"Yi Liu, Guillermo E Umpierrez, Limin Peng","doi":"10.1093/biomtc/ujaf122","DOIUrl":"10.1093/biomtc/ujaf122","url":null,"abstract":"<p><p>Recurrent episode data frequently arise in chronic disease studies when an event of interest occurs repeatedly and each occurrence lasts for a random period of time. Understanding the heterogeneity in recurrent episode lengths can help guide dynamic and customized disease management. However, there has been relative sparse attention to methods tailored to this end. Existing approaches either do not confer direct interpretation on episode lengths or involve restrictive or unrealistic distributional assumptions, such as exchangeability of within-individual episode lengths. In this work, we propose a modeling strategy that overcomes these limitations through adopting quantile regression and sensibly incorporating time-dependent covariates. Treating recurrent episodes as clustered data, we develop an estimation procedure that properly handles the special complications, including dependent censoring, dependent truncation, and informative cluster size. Our estimation procedure is computationally simple and yields estimators with desirable asymptotic properties. Our numerical studies demonstrate the advantages of the proposed method over naive adaptations of existing approaches.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448847/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adjusted predictions for generalized estimating equations. 广义估计方程的调整预测。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf090
Francis K C Hui, Samuel Muller, Alan H Welsh

Generalized estimating equations (GEEs) are a popular statistical method for longitudinal data analysis, requiring specification of the first 2 marginal moments of the response along with a working correlation matrix to capture temporal correlations within a cluster. When it comes to prediction at future/new time points using GEEs, a standard approach adopted by practitioners and software is to base it simply on the marginal mean model. In this article, we propose an alternative approach to prediction for independent cluster GEEs. By viewing the GEE as solving an iterative working linear model, we borrow ideas from universal kriging to construct an adjusted predictor that exploits working cross-correlations between the current and new observations within the same cluster. We establish theoretical conditions for the adjusted GEE predictor to outperform the standard GEE predictor. Simulations and an application to longitudinal data on the growth of sitka spruces demonstrate that, even when we misspecify the working correlation structure, adjusted GEE predictors can achieve better performance relative to standard GEE predictors, the so-called "oracle" GEE predictor using all time points, and potentially even cluster-specific predictions from a generalized linear mixed model.

广义估计方程(GEEs)是一种流行的纵向数据分析统计方法,需要指定响应的前两个边缘矩以及工作相关矩阵,以捕获集群内的时间相关性。当使用GEEs预测未来/新时间点时,从业者和软件采用的标准方法是简单地基于边际平均模型。在本文中,我们提出了一种预测独立集群GEEs的替代方法。通过将GEE视为求解迭代工作线性模型,我们借用通用克里金的思想来构建一个调整后的预测器,该预测器利用同一簇内当前和新观测之间的工作相互关系。我们建立了调整后的GEE预测器优于标准GEE预测器的理论条件。对锡特卡云杉生长的纵向数据的模拟和应用表明,即使我们错误地指定了工作相关结构,调整后的GEE预测器也可以获得更好的性能,相对于标准的GEE预测器,所谓的“oracle”GEE预测器使用所有时间点,甚至可能来自广义线性混合模型的特定集群预测。
{"title":"Adjusted predictions for generalized estimating equations.","authors":"Francis K C Hui, Samuel Muller, Alan H Welsh","doi":"10.1093/biomtc/ujaf090","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf090","url":null,"abstract":"<p><p>Generalized estimating equations (GEEs) are a popular statistical method for longitudinal data analysis, requiring specification of the first 2 marginal moments of the response along with a working correlation matrix to capture temporal correlations within a cluster. When it comes to prediction at future/new time points using GEEs, a standard approach adopted by practitioners and software is to base it simply on the marginal mean model. In this article, we propose an alternative approach to prediction for independent cluster GEEs. By viewing the GEE as solving an iterative working linear model, we borrow ideas from universal kriging to construct an adjusted predictor that exploits working cross-correlations between the current and new observations within the same cluster. We establish theoretical conditions for the adjusted GEE predictor to outperform the standard GEE predictor. Simulations and an application to longitudinal data on the growth of sitka spruces demonstrate that, even when we misspecify the working correlation structure, adjusted GEE predictors can achieve better performance relative to standard GEE predictors, the so-called \"oracle\" GEE predictor using all time points, and potentially even cluster-specific predictions from a generalized linear mixed model.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A group distributional ICA method for decomposing multi-subject diffusion tensor imaging. 多主体扩散张量成像分解的组分布ICA方法。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf117
Guangming Yang, Ben Wu, Jian Kang, Ying Guo

Diffusion tensor imaging (DTI) is a frequently used imaging modality to investigate white matter fiber connections of human brain. DTI provides an important tool for characterizing human brain structural organization. Common goals in DTI analysis include dimension reduction, denoising, and extraction of underlying structure networks. Blind source separation methods are often used to achieve these goals for other imaging modalities. However, there has been very limited work for multi-subject DTI data. Due to the special characteristics of the 3D diffusion tensor measured in DTI, existing methods such as standard independent component analysis (ICA) cannot be directly applied. We propose a Group Distributional ICA (G-DICA) method to fill this gap. G-DICA represents a fundamentally new blind source separation method that separates the parameters in the distribution function of the observed imaging data as a mixture of independent source signals. Decomposing multi-subject DTI using G-DICA uncovers structural networks corresponding to several major white matter fiber bundles in the brain. Through simulation studies and real data applications, the proposed G-DICA method demonstrates superior performance and improved reproducibility compared to the existing method.

弥散张量成像(DTI)是一种常用的研究人脑白质纤维连接的成像方式。DTI是表征人脑结构组织的重要工具。DTI分析的共同目标包括降维、去噪和提取底层结构网络。对于其他成像模式,通常采用盲源分离方法来实现这些目标。然而,对于多学科DTI数据的研究非常有限。由于DTI测量的三维扩散张量的特殊特性,现有的方法如标准独立分量分析(ICA)不能直接应用。我们提出了一种分组分布ICA (G-DICA)方法来填补这一空白。G-DICA代表了一种全新的盲源分离方法,它将观测到的成像数据的分布函数中的参数分离为独立源信号的混合物。利用G-DICA对多主体DTI进行分解,揭示了大脑中几种主要白质纤维束对应的结构网络。通过仿真研究和实际数据应用,与现有方法相比,所提出的G-DICA方法具有更好的性能和更高的再现性。
{"title":"A group distributional ICA method for decomposing multi-subject diffusion tensor imaging.","authors":"Guangming Yang, Ben Wu, Jian Kang, Ying Guo","doi":"10.1093/biomtc/ujaf117","DOIUrl":"10.1093/biomtc/ujaf117","url":null,"abstract":"<p><p>Diffusion tensor imaging (DTI) is a frequently used imaging modality to investigate white matter fiber connections of human brain. DTI provides an important tool for characterizing human brain structural organization. Common goals in DTI analysis include dimension reduction, denoising, and extraction of underlying structure networks. Blind source separation methods are often used to achieve these goals for other imaging modalities. However, there has been very limited work for multi-subject DTI data. Due to the special characteristics of the 3D diffusion tensor measured in DTI, existing methods such as standard independent component analysis (ICA) cannot be directly applied. We propose a Group Distributional ICA (G-DICA) method to fill this gap. G-DICA represents a fundamentally new blind source separation method that separates the parameters in the distribution function of the observed imaging data as a mixture of independent source signals. Decomposing multi-subject DTI using G-DICA uncovers structural networks corresponding to several major white matter fiber bundles in the brain. Through simulation studies and real data applications, the proposed G-DICA method demonstrates superior performance and improved reproducibility compared to the existing method.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448322/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A flexible framework for N-mixture occupancy models: applications to breeding bird surveys. 氮混合占用模型的灵活框架:在种鸟调查中的应用。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf087
Huu-Dinh Huynh, J Andrew Royle, Wen-Han Hwang

Estimating species abundance under imperfect detection is a key challenge in biodiversity conservation. The N-mixture model, widely recognized for its ability to distinguish between abundance and individual detection probability without marking individuals, is constrained by its stringent closure assumption, which leads to biased estimates when violated in real-world settings. To address this limitation, we propose an extended framework based on a development of the mixed Gamma-Poisson model, incorporating a community parameter that represents the proportion of individuals consistently present throughout the survey period. This flexible framework generalizes both the zero-inflated type occupancy model and the standard N-mixture model as special cases, corresponding to community parameter values of 0 and 1, respectively. The model's effectiveness is validated through simulations and applications to real-world datasets, specifically with 5 species from the North American Breeding Bird Survey and 46 species from the Swiss Breeding Bird Survey, demonstrating its improved accuracy and adaptability in settings where strict closure may not hold.

在不完全检测条件下估算物种丰度是生物多样性保护的关键问题。n -混合物模型因其在不标记个体的情况下区分丰度和个体检测概率的能力而得到广泛认可,但它受到严格的封闭假设的限制,在现实环境中违反该假设会导致估计偏差。为了解决这一限制,我们提出了一个基于混合伽玛-泊松模型发展的扩展框架,纳入了一个代表整个调查期间始终存在的个体比例的社区参数。这个灵活的框架将零膨胀型占用模型和标准n -混合模型作为特例进行推广,分别对应于社区参数值为0和1。该模型的有效性通过模拟和实际数据集的应用得到验证,特别是来自北美繁殖鸟类调查的5个物种和来自瑞士繁殖鸟类调查的46个物种,证明了其在严格封闭可能无法维持的情况下提高的准确性和适应性。
{"title":"A flexible framework for N-mixture occupancy models: applications to breeding bird surveys.","authors":"Huu-Dinh Huynh, J Andrew Royle, Wen-Han Hwang","doi":"10.1093/biomtc/ujaf087","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf087","url":null,"abstract":"<p><p>Estimating species abundance under imperfect detection is a key challenge in biodiversity conservation. The N-mixture model, widely recognized for its ability to distinguish between abundance and individual detection probability without marking individuals, is constrained by its stringent closure assumption, which leads to biased estimates when violated in real-world settings. To address this limitation, we propose an extended framework based on a development of the mixed Gamma-Poisson model, incorporating a community parameter that represents the proportion of individuals consistently present throughout the survey period. This flexible framework generalizes both the zero-inflated type occupancy model and the standard N-mixture model as special cases, corresponding to community parameter values of 0 and 1, respectively. The model's effectiveness is validated through simulations and applications to real-world datasets, specifically with 5 species from the North American Breeding Bird Survey and 46 species from the Swiss Breeding Bird Survey, demonstrating its improved accuracy and adaptability in settings where strict closure may not hold.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to "Propensity weighting plus adjustment in proportional hazards model is not doubly robust," by Erin E. Gabriel, Michael C. Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F. Blanche, Stijn Vansteelandt, Arvid Sjölander, and Thomas Scheike; Volume 80, Issue 3, September 2024, https://doi.org/10.1093/biomtc/ujae069. 对Erin E. Gabriel、Michael C. Sachs、Ingeborg Waernbaum、Els Goetghebeur、Paul F. Blanche、Stijn Vansteelandt、Arvid Sjölander和Thomas Scheike的“比例风险模型的倾向加权加调整并非双重稳健”的修正;80卷,第3期,2024年9月,https://doi.org/10.1093/biomtc/ujae069。
IF 1.4 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf091
Erin E Gabriel, Michael C Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F Blanche, Stijn Vansteelandt, Arvid Sjölander, Thomas Scheike
{"title":"Correction to \"Propensity weighting plus adjustment in proportional hazards model is not doubly robust,\" by Erin E. Gabriel, Michael C. Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F. Blanche, Stijn Vansteelandt, Arvid Sjölander, and Thomas Scheike; Volume 80, Issue 3, September 2024, https://doi.org/10.1093/biomtc/ujae069.","authors":"Erin E Gabriel, Michael C Sachs, Ingeborg Waernbaum, Els Goetghebeur, Paul F Blanche, Stijn Vansteelandt, Arvid Sjölander, Thomas Scheike","doi":"10.1093/biomtc/ujaf091","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf091","url":null,"abstract":"","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using model-assisted calibration methods to improve efficiency of regression analyses using two-phase samples or pooled samples under complex survey designs. 利用模型辅助校准方法提高复杂调查设计下两相样本或混合样本回归分析的效率。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf092
Lingxiao Wang

Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the "pooled design" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.

两阶段抽样设计常用于流行病学研究和大规模健康调查。在这种设计中,某些变量只在初始第一阶段样本的第二阶段随机子样本中收集,这通常是由于诸如高成本、响应负担或数据收集或评估的限制等因素。因此,由于样本量的减少,第二阶段的样本估计器可能是低效的。模型辅助校正方法用于提高回归分析中第二阶段估计器的效率。然而,有限的文献提供了使用适当的校准辅助变量的校准估计器的有效有限总体推断,同时考虑了第一阶段和第二阶段样本的复杂样本设计。此外,没有文献考虑在某些重复调查周期中只测量某些协变量的“合并设计”。本文提出基于使用第二阶段变量对第一阶段样本的预测的回归模型的得分函数,将第二阶段样本的样本权重校准为加权的第一阶段样本。我们使用校准的权重来建立估计的一致性,并对两阶段设计或复杂调查设计中嵌套的合并设计下的回归系数进行方差估计。与现有的校准和插值方法相比,经验证据突出了所提出的校准的效率和鲁棒性。提供了来自全国健康和营养检查调查的数据实例。
{"title":"Using model-assisted calibration methods to improve efficiency of regression analyses using two-phase samples or pooled samples under complex survey designs.","authors":"Lingxiao Wang","doi":"10.1093/biomtc/ujaf092","DOIUrl":"10.1093/biomtc/ujaf092","url":null,"abstract":"<p><p>Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the \"pooled design\" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mastering rare event analysis: subsample-size determination in Cox and logistic regressions. 掌握罕见事件分析:在Cox和逻辑回归中确定子样本大小。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf110
Tal Agassi, Nir Keret, Malka Gorfine

In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.

在当代数据分析领域,大量数据集的使用已经变得越来越重要,尽管通常需要大量的计算时间和内存。虽然许多现有的工作提供了以最小的效率损失对子样本进行分析的最佳子抽样方法,但它们明显缺乏明智地选择子样本大小的工具。为了弥补这一差距,我们的工作引入了用于选择子样本大小的工具。我们关注三种设置:针对罕见事件的生存数据的Cox回归模型,以及针对平衡和不平衡数据集的逻辑回归。此外,我们提出了一种新的最优子抽样程序,适合于不平衡数据的逻辑回归。通过对两个大型数据集的广泛模拟研究和细致分析,证明了这些工具和程序的有效性:对英国生物银行(UK Biobank)约3.5亿行结直肠癌数据的生存分析,以及对约2800万观察值的相关出生和婴儿死亡数据的逻辑回归。
{"title":"Mastering rare event analysis: subsample-size determination in Cox and logistic regressions.","authors":"Tal Agassi, Nir Keret, Malka Gorfine","doi":"10.1093/biomtc/ujaf110","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf110","url":null,"abstract":"<p><p>In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cumulative incidence function estimation using population-based biobank data. 基于种群的生物样本库数据的累积关联函数估计。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf049
Malka Gorfine, David M Zucker, Shoval Shoham

Many countries have established population-based biobanks, which are being used increasingly in epidemiological and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data are collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency and (2) CIF estimation for ages before the lower limit, $c_L$.

许多国家建立了以人口为基础的生物库,并越来越多地用于流行病学和临床研究。这些生物库为解决传统临床试验或队列研究范围之外的问题提供了大规模研究的机会。然而,使用生物银行数据带来了新的挑战。通常情况下,生物银行的数据是从一个确定的日历期间招募的研究队列中收集的,受试者在不同年龄进入研究,年龄介于$c_L$和$c_U$之间。这项工作的重点是生物库数据,包括招募时报告疾病发病年龄的个人,称为流行数据,以及最初招募时健康的个人,以及在随访期间观察到的疾病发病情况。与现有方法相比,我们提出了一种新的累积关联函数(CIF)估计器,该估计器有效地包含了流行病例,具有两个优点:(1)提高了效率;(2)在下限c_L之前的年龄进行CIF估计。
{"title":"Cumulative incidence function estimation using population-based biobank data.","authors":"Malka Gorfine, David M Zucker, Shoval Shoham","doi":"10.1093/biomtc/ujaf049","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf049","url":null,"abstract":"<p><p>Many countries have established population-based biobanks, which are being used increasingly in epidemiological and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data are collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency and (2) CIF estimation for ages before the lower limit, $c_L$.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144783415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical significance of clustering for count data. 计数数据聚类的统计显著性。
IF 1.7 4区 数学 Q3 BIOLOGY Pub Date : 2025-07-03 DOI: 10.1093/biomtc/ujaf120
Yifan Dai, Di Wu, Yufeng Liu

Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.

聚类在生物医学研究中广泛应用于有意义的亚群识别。然而,大多数现有的聚类算法没有考虑到聚类的统计不确定性,因此可能会由于自然采样变化而产生虚假聚类。为了解决这个问题,开发了聚类的统计显著性(SigClust)方法来评估高维数据中聚类的显著性。虽然SigClust已经成功地评估了连续数据的聚类显著性,但它并不是专门为离散数据设计的,比如基因组学中的计数数据。此外,SigClust及其变体在应用于非高斯高维数据时可能会受到统计能力降低的影响。为了克服这些限制,我们提出了sigcluster - dev,这是一种旨在评估计数数据中集群重要性的方法。通过广泛的模拟,我们将sigcluster - dev与其他现有的sigcluster方法在各种计数分布中进行了比较,并证明了其优越的性能。此外,我们将我们提出的sigcluster - dev应用于Hydra单细胞RNA测序(scRNA)数据和癌症患者的电子健康记录(EHRs),分别识别有意义的潜在细胞类型和患者亚组。
{"title":"Statistical significance of clustering for count data.","authors":"Yifan Dai, Di Wu, Yufeng Liu","doi":"10.1093/biomtc/ujaf120","DOIUrl":"10.1093/biomtc/ujaf120","url":null,"abstract":"<p><p>Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448855/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biometrics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1