Statistics in Medicine最新文献_第2页

Estimating Mean Viral Load Trajectory From Intermittent Longitudinal Data and Unknown Time Origins.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-02-28 DOI: 10.1002/sim.70033

Yonatan Woodbridge, Micha Mandel, Yair Goldberg, Amit Huppert

Viral load (VL) in the respiratory tract is the leading proxy for assessing infectiousness potential. Understanding the dynamics of disease-related VL within the host is of great importance, as it helps to determine different policies and health recommendations. However, normally the VL is measured on individuals only once, in order to confirm infection, and furthermore, the infection date is unknown. It is therefore necessary to develop statistical approaches to estimate the typical VL trajectory. We show here that, under plausible parametric assumptions, two measures of VL on infected individuals can be used to accurately estimate the VL mean function. Specifically, we consider a discrete-time likelihood-based approach to modeling and estimating partial observed longitudinal samples. We study a multivariate normal model for a function of the VL that accounts for possible correlation between measurements within individuals. We derive an expectation-maximization (EM) algorithm which treats the unknown time origins and the missing measurements as latent variables. Our main motivation is the reconstruction of the daily mean VL, given measurements on patients whose VLs were measured multiple times on different days. Such data should and can be obtained at the beginning of a pandemic with the specific goal of estimating the VL dynamics. For demonstration purposes, the method is applied to SARS-Cov-2 cycle-threshold-value data collected in Israel.

{"title":"Estimating Mean Viral Load Trajectory From Intermittent Longitudinal Data and Unknown Time Origins.","authors":"Yonatan Woodbridge, Micha Mandel, Yair Goldberg, Amit Huppert","doi":"10.1002/sim.70033","DOIUrl":"10.1002/sim.70033","url":null,"abstract":"Viral load (VL) in the respiratory tract is the leading proxy for assessing infectiousness potential. Understanding the dynamics of disease-related VL within the host is of great importance, as it helps to determine different policies and health recommendations. However, normally the VL is measured on individuals only once, in order to confirm infection, and furthermore, the infection date is unknown. It is therefore necessary to develop statistical approaches to estimate the typical VL trajectory. We show here that, under plausible parametric assumptions, two measures of VL on infected individuals can be used to accurately estimate the VL mean function. Specifically, we consider a discrete-time likelihood-based approach to modeling and estimating partial observed longitudinal samples. We study a multivariate normal model for a function of the VL that accounts for possible correlation between measurements within individuals. We derive an expectation-maximization (EM) algorithm which treats the unknown time origins and the missing measurements as latent variables. Our main motivation is the reconstruction of the daily mean VL, given measurements on patients whose VLs were measured multiple times on different days. Such data should and can be obtained at the beginning of a pandemic with the specific goal of estimating the VL dynamics. For demonstration purposes, the method is applied to SARS-Cov-2 cycle-threshold-value data collected in Israel.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 5","pages":"e70033"},"PeriodicalIF":1.8,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11851093/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143493468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian Modeling of Cancer Outcomes Using Genetic Variables Assisted by Pathological Imaging Data. 利用遗传变量辅助病理影像数据的癌症预后贝叶斯模型。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-02-10 DOI: 10.1002/sim.10350

Yunju Im, Rong Li, Shuangge Ma

With the increasing maturity of genetic profiling, an essential and routine task in cancer research is to model disease outcomes/phenotypes using genetic variables. Many methods have been successfully developed. However, oftentimes, empirical performance is unsatisfactory because of a "lack of information." In cancer research and clinical practice, a source of information that is broadly available and highly cost-effective comes from pathological images, which are routinely collected for definitive diagnosis and staging. In this article, we consider a Bayesian approach for selecting relevant genetic variables and modeling their relationships with a cancer outcome/phenotype. We propose borrowing information from (manually curated, low-dimensional) pathological imaging features via reinforcing the same selection results for the cancer outcome and imaging features. We further develop a weighting strategy to accommodate the scenario where information borrowing may not be equally effective for all subjects. Computation is carefully examined. Simulations demonstrate competitive performance of the proposed approach. We analyze TCGA (The Cancer Genome Atlas) LUAD (lung adenocarcinoma) data, with overall survival and gene expressions being the outcome and genetic variables, respectively. Findings different from the alternatives and with sound properties are made.

随着遗传图谱的日益成熟，癌症研究的一项基本和常规任务是使用遗传变量来建模疾病结果/表型。许多方法已经被成功地开发出来。然而，通常情况下，由于“缺乏信息”，经验表现是不令人满意的。在癌症研究和临床实践中，一种广泛可用且成本效益高的信息来源来自病理图像，常规收集病理图像用于明确诊断和分期。在这篇文章中，我们考虑贝叶斯方法来选择相关的遗传变量和建模它们与癌症结果/表型的关系。我们建议通过加强对癌症结果和成像特征的相同选择结果，从（人工策划的，低维的）病理成像特征中借鉴信息。我们进一步开发了一种加权策略，以适应信息借用可能对所有科目都不一样有效的情况。计算是仔细检查的。仿真结果表明，该方法具有良好的性能。我们分析了TCGA（癌症基因组图谱）和LUAD（肺腺癌）数据，将总生存率和基因表达分别作为结果和遗传变量。取得了不同于替代材料且具有良好性能的结果。

{"title":"Bayesian Modeling of Cancer Outcomes Using Genetic Variables Assisted by Pathological Imaging Data.","authors":"Yunju Im, Rong Li, Shuangge Ma","doi":"10.1002/sim.10350","DOIUrl":"10.1002/sim.10350","url":null,"abstract":"With the increasing maturity of genetic profiling, an essential and routine task in cancer research is to model disease outcomes/phenotypes using genetic variables. Many methods have been successfully developed. However, oftentimes, empirical performance is unsatisfactory because of a \"lack of information.\" In cancer research and clinical practice, a source of information that is broadly available and highly cost-effective comes from pathological images, which are routinely collected for definitive diagnosis and staging. In this article, we consider a Bayesian approach for selecting relevant genetic variables and modeling their relationships with a cancer outcome/phenotype. We propose borrowing information from (manually curated, low-dimensional) pathological imaging features via reinforcing the same selection results for the cancer outcome and imaging features. We further develop a weighting strategy to accommodate the scenario where information borrowing may not be equally effective for all subjects. Computation is carefully examined. Simulations demonstrate competitive performance of the proposed approach. We analyze TCGA (The Cancer Genome Atlas) LUAD (lung adenocarcinoma) data, with overall survival and gene expressions being the outcome and genetic variables, respectively. Findings different from the alternatives and with sound properties are made.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 3-4","pages":"e10350"},"PeriodicalIF":1.8,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11774474/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143011847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adjusting for Ascertainment Bias in Meta-Analysis of Penetrance for Cancer Risk.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-02-10 DOI: 10.1002/sim.10323

Thanthirige Lakshika M Ruberu, Danielle Braun, Giovanni Parmigiani, Swati Biswas

Multi-gene panel testing allows efficient detection of pathogenic variants in cancer susceptibility genes including moderate-risk genes such as ATM and PALB2. A growing number of studies examine the risk of breast cancer (BC) conferred by pathogenic variants of these genes. A meta-analysis combining the reported risk estimates can provide an overall estimate of age-specific risk of developing BC, that is, penetrance for a gene. However, estimates reported by case-control studies often suffer from ascertainment bias. Currently, there is no method available to adjust for such bias in this setting. We consider a Bayesian random effect meta-analysis method that can synthesize different types of risk measures and extend it to incorporate studies with ascertainment bias. This is achieved by introducing a bias term in the model and assigning appropriate priors. We validate the method through a simulation study and apply it to estimate BC penetrance for carriers of pathogenic variants in the ATM and PALB2 genes. Our simulations show that the proposed method results in more accurate and precise penetrance estimates compared to when no adjustment is made for ascertainment bias or when such biased studies are discarded from the analysis. The overall estimated BC risk for individuals with pathogenic variants are (1) 5.77% (3.22%-9.67%) by age 50 and 26.13% (20.31%-32.94%) by age 80 for ATM; (2) 12.99% (6.48%-22.23%) by age 50, and 44.69% (34.40%-55.80%) by age 80 for PALB2. The proposed method allows meta-analyses to include studies with ascertainment bias, resulting in inclusion of more studies and thereby more accurate estimates.

{"title":"Adjusting for Ascertainment Bias in Meta-Analysis of Penetrance for Cancer Risk.","authors":"Thanthirige Lakshika M Ruberu, Danielle Braun, Giovanni Parmigiani, Swati Biswas","doi":"10.1002/sim.10323","DOIUrl":"10.1002/sim.10323","url":null,"abstract":"Multi-gene panel testing allows efficient detection of pathogenic variants in cancer susceptibility genes including moderate-risk genes such as ATM and PALB2. A growing number of studies examine the risk of breast cancer (BC) conferred by pathogenic variants of these genes. A meta-analysis combining the reported risk estimates can provide an overall estimate of age-specific risk of developing BC, that is, penetrance for a gene. However, estimates reported by case-control studies often suffer from ascertainment bias. Currently, there is no method available to adjust for such bias in this setting. We consider a Bayesian random effect meta-analysis method that can synthesize different types of risk measures and extend it to incorporate studies with ascertainment bias. This is achieved by introducing a bias term in the model and assigning appropriate priors. We validate the method through a simulation study and apply it to estimate BC penetrance for carriers of pathogenic variants in the ATM and PALB2 genes. Our simulations show that the proposed method results in more accurate and precise penetrance estimates compared to when no adjustment is made for ascertainment bias or when such biased studies are discarded from the analysis. The overall estimated BC risk for individuals with pathogenic variants are (1) 5.77% (3.22%-9.67%) by age 50 and 26.13% (20.31%-32.94%) by age 80 for ATM; (2) 12.99% (6.48%-22.23%) by age 50, and 44.69% (34.40%-55.80%) by age 80 for PALB2. The proposed method allows meta-analyses to include studies with ascertainment bias, resulting in inclusion of more studies and thereby more accurate estimates.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 3-4","pages":"e10323"},"PeriodicalIF":1.8,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143047829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Allocation Predictability of Individual Assignments in Restricted Randomization Designs for Two-Arm Equal Allocation Trials.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-02-10 DOI: 10.1002/sim.10343

Wenle Zhao, Sherry Livingston

This manuscript derives the allocation predictability measured by the correct guess probability and the probability of being deterministic for individual treatment assignments, as well as the averages of a randomization sequence, based on the treatment imbalance transition matrix and the conditional allocation probability. The methods described are applicable to restricted randomization designs that satisfy the following criteria: (1) two-arm equal allocation, (2) restriction of maximum tolerated imbalance, and (3) conditional allocation probability fully determined by the observed current treatment imbalance. Analytical results indicate that, for two-arm equal allocation trials, allocation predictability alternates by the odd/even sequence order of the treatment assignment. Additionally, the sequence average allocation predictability converges to its asymptotic value significantly more slowly than the allocation predictability for individual assignment does. Consequently, comparisons of allocation predictability between different randomization designs based on sequence averages are sensitive to sequence length. Using sequence average allocation predictability may underestimate the risk of selection bias for individual assignment. This discrepancy is particularly pronounced for short sequence lengths, where individual assignment predictability can be substantially higher than the sequence average.

{"title":"Allocation Predictability of Individual Assignments in Restricted Randomization Designs for Two-Arm Equal Allocation Trials.","authors":"Wenle Zhao, Sherry Livingston","doi":"10.1002/sim.10343","DOIUrl":"10.1002/sim.10343","url":null,"abstract":"This manuscript derives the allocation predictability measured by the correct guess probability and the probability of being deterministic for individual treatment assignments, as well as the averages of a randomization sequence, based on the treatment imbalance transition matrix and the conditional allocation probability. The methods described are applicable to restricted randomization designs that satisfy the following criteria: (1) two-arm equal allocation, (2) restriction of maximum tolerated imbalance, and (3) conditional allocation probability fully determined by the observed current treatment imbalance. Analytical results indicate that, for two-arm equal allocation trials, allocation predictability alternates by the odd/even sequence order of the treatment assignment. Additionally, the sequence average allocation predictability converges to its asymptotic value significantly more slowly than the allocation predictability for individual assignment does. Consequently, comparisons of allocation predictability between different randomization designs based on sequence averages are sensitive to sequence length. Using sequence average allocation predictability may underestimate the risk of selection bias for individual assignment. This discrepancy is particularly pronounced for short sequence lengths, where individual assignment predictability can be substantially higher than the sequence average.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 3-4","pages":"e10343"},"PeriodicalIF":1.8,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11810053/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143034090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asymptotic Properties of Matthews Correlation Coefficient. 马修斯相关系数的渐近性质。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-01-15 Epub Date: 2024-12-16 DOI: 10.1002/sim.10303

Yuki Itaya, Jun Tamura, Kenichi Hayashi, Kouji Yamamoto

Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC), also known as the phi coefficient, is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates-a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field.

评估分类在统计学和机器学习中至关重要，因为它会影响各个领域的决策，例如患者预后和危重情况下的治疗。Matthews相关系数（MCC），也称为phi系数，被认为是一个高可靠性的性能指标，即使在班级不平衡的情况下也能提供一个平衡的测量。尽管它很重要，但对MCC的统计推断仍缺乏全面的研究。这一缺陷常常导致研究仅仅验证和比较MCC点估计——这种做法虽然很常见，但却忽视了结果的统计意义和可靠性。为了解决这一研究空白，本文介绍并评估了几种方法来构建单个MCC和配对设计中MCC之间差异的渐近置信区间。通过各种场景的模拟，我们评估了这些方法的有限样本行为，并比较了它们的性能。此外，通过实际数据分析，我们说明了我们的发现在比较二元分类器方面的潜在效用，突出了我们在该领域的研究可能做出的贡献。

{"title":"Asymptotic Properties of Matthews Correlation Coefficient.","authors":"Yuki Itaya, Jun Tamura, Kenichi Hayashi, Kouji Yamamoto","doi":"10.1002/sim.10303","DOIUrl":"10.1002/sim.10303","url":null,"abstract":"Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC), also known as the phi coefficient, is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates-a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10303"},"PeriodicalIF":1.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142839901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On GEE for Mean-Variance-Correlation Models: Variance Estimation and Model Selection. 均值-方差-相关模型的GEE：方差估计与模型选择。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-01-15 Epub Date: 2024-12-12 DOI: 10.1002/sim.10271

Zhenyu Xu, Jason P Fine, Wenling Song, Jun Yan

Generalized estimating equations (GEE) are of great importance in analyzing clustered data without full specification of multivariate distributions. A recent approach by Luo and Pan jointly models the mean, variance, and correlation coefficients of clustered data through three sets of regressions. We note that it represents a specific case of the more general estimating equations proposed by Yan and Fine which further allow the variance to depend on the mean through a variance function. In certain scenarios, the proposed variance estimators for the variance and correlation parameters in Luo and Pan may face challenges due to the subtle dependence induced by the nested structure of the estimating equations. We characterize specific model settings where their variance estimation approach may encounter limitations and illustrate how the variance estimators in Yan and Fine can correctly account for such dependencies. In addition, we introduce a novel model selection criterion that enables the simultaneous selection of the mean-scale-correlation model. The sandwich variance estimator and the proposed model selection criterion are tested by several simulation studies and real data analysis, which validate its effectiveness in variance estimation and model selection. Our work also extends the R package geepack with the flexibility to apply different working covariance matrices for the variance and correlation structures.

广义估计方程（GEE）对于分析没有充分说明多元分布的聚类数据具有重要意义。最近，Luo和Pan通过三组回归对聚类数据的均值、方差和相关系数进行了建模。我们注意到，它代表了Yan和Fine提出的更一般的估计方程的一个特定情况，该方程进一步允许方差通过方差函数依赖于均值。在某些情况下，由于估计方程的嵌套结构导致的微妙依赖，对Luo和Pan中方差和相关参数的方差估计可能会面临挑战。我们描述了他们的方差估计方法可能遇到限制的特定模型设置，并说明了Yan和Fine中的方差估计器如何正确地解释这种依赖关系。此外，我们还引入了一种新的模型选择准则，可以同时选择平均尺度相关模型。通过仿真研究和实际数据分析，验证了三明治方差估计器和模型选择准则在方差估计和模型选择方面的有效性。我们的工作还扩展了R包geepack，使其能够灵活地为方差和相关结构应用不同的工作协方差矩阵。

{"title":"On GEE for Mean-Variance-Correlation Models: Variance Estimation and Model Selection.","authors":"Zhenyu Xu, Jason P Fine, Wenling Song, Jun Yan","doi":"10.1002/sim.10271","DOIUrl":"10.1002/sim.10271","url":null,"abstract":"Generalized estimating equations (GEE) are of great importance in analyzing clustered data without full specification of multivariate distributions. A recent approach by Luo and Pan jointly models the mean, variance, and correlation coefficients of clustered data through three sets of regressions. We note that it represents a specific case of the more general estimating equations proposed by Yan and Fine which further allow the variance to depend on the mean through a variance function. In certain scenarios, the proposed variance estimators for the variance and correlation parameters in Luo and Pan may face challenges due to the subtle dependence induced by the nested structure of the estimating equations. We characterize specific model settings where their variance estimation approach may encounter limitations and illustrate how the variance estimators in Yan and Fine can correctly account for such dependencies. In addition, we introduce a novel model selection criterion that enables the simultaneous selection of the mean-scale-correlation model. The sandwich variance estimator and the proposed model selection criterion are tested by several simulation studies and real data analysis, which validate its effectiveness in variance estimation and model selection. Our work also extends the R package geepack with the flexibility to apply different working covariance matrices for the variance and correlation structures.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10271"},"PeriodicalIF":1.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142814343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Simple Information Criterion for Variable Selection in High-Dimensional Regression. 高维回归中变量选择的一个简单信息准则。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-01-15 Epub Date: 2024-12-12 DOI: 10.1002/sim.10275

Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismaïl Ahmed

High-dimensional regression problems, for example with genomic or drug exposure data, typically involve automated selection of a sparse set of regressors. Penalized regression methods like the LASSO can deliver a family of candidate sparse models. To select one, there are criteria balancing log-likelihood and model size, the most common being AIC and BIC. These two methods do not take into account the implicit multiple testing performed when selecting variables in a high-dimensional regression, which makes them too liberal. We propose the extended AIC (EAIC), a new information criterion for sparse model selection in high-dimensional regressions. It allows for asymptotic FWER control when the candidate regressors are independent. It is based on a simple formula involving model log-likelihood, model size, the total number of candidate regressors, and the FWER target. In a simulation study over a wide range of linear and logistic regression settings, we assessed the variable selection performance of the EAIC and of other information criteria (including some that also use the number of candidate regressors: mBIC, mAIC, and EBIC) in conjunction with the LASSO. Our method controls the FWER in nearly all settings, in contrast to the AIC and BIC, which produce many false positives. We also illustrate it for the automated signal detection of adverse drug reactions on the French pharmacovigilance spontaneous reporting database.

高维回归问题，例如基因组或药物暴露数据，通常涉及稀疏回归量集的自动选择。像LASSO这样的惩罚回归方法可以提供一系列候选稀疏模型。要选择一个，有平衡对数似然和模型大小的标准，最常见的是AIC和BIC。这两种方法没有考虑到在高维回归中选择变量时执行的隐式多重测试，这使得它们过于自由。提出了一种新的用于高维回归稀疏模型选择的信息准则——扩展AIC （EAIC）。当候选回归量是独立的时，它允许渐近FWER控制。它基于一个简单的公式，涉及模型对数似然、模型大小、候选回归量的总数和FWER目标。在广泛的线性和逻辑回归设置的模拟研究中，我们结合LASSO评估了EAIC和其他信息标准（包括一些也使用候选回归量的标准：mBIC、mAIC和EBIC）的变量选择性能。与AIC和BIC相比，我们的方法在几乎所有设置下都控制了FWER，而AIC和BIC会产生许多误报。我们还举例说明了法国药物警戒自发报告数据库上药物不良反应的自动信号检测。

{"title":"A Simple Information Criterion for Variable Selection in High-Dimensional Regression.","authors":"Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismaïl Ahmed","doi":"10.1002/sim.10275","DOIUrl":"10.1002/sim.10275","url":null,"abstract":"High-dimensional regression problems, for example with genomic or drug exposure data, typically involve automated selection of a sparse set of regressors. Penalized regression methods like the LASSO can deliver a family of candidate sparse models. To select one, there are criteria balancing log-likelihood and model size, the most common being AIC and BIC. These two methods do not take into account the implicit multiple testing performed when selecting variables in a high-dimensional regression, which makes them too liberal. We propose the extended AIC (EAIC), a new information criterion for sparse model selection in high-dimensional regressions. It allows for asymptotic FWER control when the candidate regressors are independent. It is based on a simple formula involving model log-likelihood, model size, the total number of candidate regressors, and the FWER target. In a simulation study over a wide range of linear and logistic regression settings, we assessed the variable selection performance of the EAIC and of other information criteria (including some that also use the number of candidate regressors: mBIC, mAIC, and EBIC) in conjunction with the LASSO. Our method controls the FWER in nearly all settings, in contrast to the AIC and BIC, which produce many false positives. We also illustrate it for the automated signal detection of adverse drug reactions on the French pharmacovigilance spontaneous reporting database.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10275"},"PeriodicalIF":1.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142814333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smooth Hazards With Multiple Time Scales. 平滑危险与多个时间尺度。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-01-15 Epub Date: 2024-12-09 DOI: 10.1002/sim.10297

Angela Carollo, Paul Eilers, Hein Putter, Jutta Gampe

Hazard models are the most commonly used tool to analyze time-to-event data. If more than one time scale is relevant for the event under study, models are required that can incorporate the dependence of a hazard along two (or more) time scales. Such models should be flexible to capture the joint influence of several time scales, and nonparametric smoothing techniques are obvious candidates. $P$ -splines offer a flexible way to specify such hazard surfaces, and estimation is achieved by maximizing a penalized Poisson likelihood. Standard observation schemes, such as right-censoring and left-truncation, can be accommodated in a straightforward manner. Proportional hazards regression with a baseline hazard varying over two time scales is presented. Efficient computation is possible by generalized linear array model (GLAM) algorithms or by exploiting a sparse mixed model formulation. A companion R-package is provided.

风险模型是分析事件时间数据最常用的工具。如果一个以上的时间尺度与所研究的事件相关，则需要能够将危险在两个（或更多）时间尺度上的依赖性纳入模型。这样的模型应该是灵活的，以捕捉几个时间尺度的共同影响，非参数平滑技术是明显的候选人。P $$ P $$样条提供了一种灵活的方法来指定这样的危险表面，估计是通过最大化惩罚泊松似然来实现的。标准的观测方案，如右截和左截，可以以一种直接的方式进行调整。提出了在两个时间尺度上具有基线风险变化的比例风险回归。通过广义线性阵列模型（GLAM）算法或利用稀疏混合模型公式可以实现高效的计算。提供了一个配套的r包。

引用次数: 0

Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are Available. 当只有平均值、协方差和样本量可用时，联邦数据的线性混合建模。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2025-01-15 Epub Date: 2024-12-11 DOI: 10.1002/sim.10300

Marie Analiz April Limpoco, Christel Faes, Niel Hens

In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as a linear mixed model, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning tackles this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analysts is required. In this article, we propose an alternative framework to federated learning for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the likelihood framework as theoretical support, this proposed strategy achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania. Assuming that each clinic only shares summary statistics once, we model the COVID-19 polymerase chain reaction test cycle threshold as a function of patient information. Simplicity, communication efficiency, generalisability, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature.

在医学研究中，个人层面的患者数据提供了宝贵的信息，但患者的保密权仍然是最优先的。这在估计统计模型（如线性混合模型）时提出了巨大的挑战，线性混合模型是线性回归模型的扩展，可以解释来自不同数据提供者的数据的潜在异质性。联邦学习通过估计参数而不检索个人层面的数据来解决这一障碍。相反，需要在数据提供者和分析人员之间进行参数估计更新的迭代通信。在本文中，我们提出了一个用于拟合线性混合模型的联邦学习的替代框架。具体来说，我们的方法只需要一次来自不同数据提供者的多个协变量的均值、协方差和样本量。利用似然框架内的统计充分性原则作为理论支持，该建议的策略实现了与实际个人数据得出的估计值相同的估计值。我们通过宾夕法尼亚儿童医院70个诊所的15068个病人记录的真实数据来证明这种方法。假设每个诊所只共享汇总统计数据一次，我们将COVID-19聚合酶链反应测试周期阈值建模为患者信息的函数。简单，沟通效率，通用性，和更广泛的实施范围在任何统计软件区分我们的方法从现有的策略在文献中。

{"title":"Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are Available.","authors":"Marie Analiz April Limpoco, Christel Faes, Niel Hens","doi":"10.1002/sim.10300","DOIUrl":"10.1002/sim.10300","url":null,"abstract":"In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as a linear mixed model, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning tackles this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analysts is required. In this article, we propose an alternative framework to federated learning for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the likelihood framework as theoretical support, this proposed strategy achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania. Assuming that each clinic only shares summary statistics once, we model the COVID-19 polymerase chain reaction test cycle threshold as a function of patient information. Simplicity, communication efficiency, generalisability, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10300"},"PeriodicalIF":1.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142814337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sieve Maximum Likelihood Estimation of Partially Linear Transformation Models With Interval-Censored Data. 具有区间删失数据的部分线性变换模型的筛式最大似然估计。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine

Pub Date : 2024-12-30 Epub Date: 2024-11-14 DOI: 10.1002/sim.10225

Changhui Yuan, Shishun Zhao, Shuwei Li, Xinyuan Song

Partially linear models provide a valuable tool for modeling failure time data with nonlinear covariate effects. Their applicability and importance in survival analysis have been widely acknowledged. To date, numerous inference methods for such models have been developed under traditional right censoring. However, the existing studies seldom target interval-censored data, which provide more coarse information and frequently occur in many scientific studies involving periodical follow-up. In this work, we propose a flexible class of partially linear transformation models to examine parametric and nonparametric covariate effects for interval-censored outcomes. We consider the sieve maximum likelihood estimation approach that approximates the cumulative baseline hazard function and nonparametric covariate effect with the monotone splines and $B$ -splines, respectively. We develop an easy-to-implement expectation-maximization algorithm coupled with three-stage data augmentation to facilitate maximization. We establish the consistency of the proposed estimators and the asymptotic distribution of parametric components based on the empirical process techniques. Numerical results from extensive simulation studies indicate that our proposed method performs satisfactorily in finite samples. An application to a study on hypobaric decompression sickness suggests that the variable TR360 exhibits a significant dynamic and nonlinear effect on the risk of developing hypobaric decompression sickness.

部分线性模型为具有非线性协变量效应的失效时间数据建模提供了宝贵的工具。它们在生存分析中的适用性和重要性已得到广泛认可。迄今为止，在传统的右普查条件下，已开发出许多针对此类模型的推断方法。然而，现有的研究很少针对区间删失数据，而区间删失数据能提供更粗略的信息，并经常出现在许多涉及定期随访的科学研究中。在这项工作中，我们提出了一类灵活的部分线性变换模型，用于检验区间删失结果的参数和非参数协变量效应。我们考虑了筛分最大似然估计方法，该方法分别用单调样条和 B $ B $ B -样条逼近累积基线危险函数和非参数协变量效应。我们开发了一种易于实现的期望最大化算法，并结合了三阶段数据扩增以促进最大化。我们基于经验过程技术，建立了所提出估计器的一致性和参数成分的渐近分布。大量模拟研究的数值结果表明，我们提出的方法在有限样本中的表现令人满意。应用于低压减压病研究的结果表明，变量 TR360 对患低压减压病的风险有显著的动态非线性影响。

{"title":"Sieve Maximum Likelihood Estimation of Partially Linear Transformation Models With Interval-Censored Data.","authors":"Changhui Yuan, Shishun Zhao, Shuwei Li, Xinyuan Song","doi":"10.1002/sim.10225","DOIUrl":"10.1002/sim.10225","url":null,"abstract":"Partially linear models provide a valuable tool for modeling failure time data with nonlinear covariate effects. Their applicability and importance in survival analysis have been widely acknowledged. To date, numerous inference methods for such models have been developed under traditional right censoring. However, the existing studies seldom target interval-censored data, which provide more coarse information and frequently occur in many scientific studies involving periodical follow-up. In this work, we propose a flexible class of partially linear transformation models to examine parametric and nonparametric covariate effects for interval-censored outcomes. We consider the sieve maximum likelihood estimation approach that approximates the cumulative baseline hazard function and nonparametric covariate effect with the monotone splines and <math> <semantics><mrow><mi>B</mi></mrow> <annotation>$$ B $$</annotation></semantics> </math> -splines, respectively. We develop an easy-to-implement expectation-maximization algorithm coupled with three-stage data augmentation to facilitate maximization. We establish the consistency of the proposed estimators and the asymptotic distribution of parametric components based on the empirical process techniques. Numerical results from extensive simulation studies indicate that our proposed method performs satisfactorily in finite samples. An application to a study on hypobaric decompression sickness suggests that the variable TR360 exhibits a significant dynamic and nonlinear effect on the risk of developing hypobaric decompression sickness.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"5765-5776"},"PeriodicalIF":1.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142628019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0