首页 > 最新文献

Biometrical Journal最新文献

英文 中文
Sample Size Calculation for an Individual Stepped-Wedge Randomized Trial 个人阶梯楔形随机试验的样本量计算。
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-11 DOI: 10.1002/bimj.202300167
Aude Allemang-Trivalle, Annabel Maruani, Bruno Giraudeau

In the individual stepped-wedge randomized trial (ISW-RT), subjects are allocated to sequences, each sequence being defined by a control period followed by an experimental period. The total follow-up time is the same for all sequences, but the duration of the control and experimental periods varies among sequences. To our knowledge, there is no validated sample size calculation formula for ISW-RTs unlike stepped-wedge cluster randomized trials (SW-CRTs). The objective of this study was to adapt the formula used for SW-CRTs to the case of individual randomization and to validate this adaptation using a Monte Carlo simulation study. The proposed sample size calculation formula for an ISW-RT design yielded satisfactory empirical power for most scenarios except scenarios with operating characteristic values near the boundary (i.e., smallest possible number of periods, very high or very low autocorrelation coefficient). Overall, the results provide useful insights into the sample size calculation for ISW-RTs.

在个体阶梯式楔形随机试验(ISW-RT)中,受试者被分配到序列中,每个序列由一个控制期和一个实验期组成。所有序列的总随访时间相同,但不同序列的对照期和实验期的持续时间各不相同。据我们所知,与阶梯式楔形分组随机试验(SW-CRT)不同,ISW-RT 没有经过验证的样本量计算公式。本研究的目的是将用于阶梯式分组随机试验的公式适用于个体随机化的情况,并通过蒙特卡罗模拟研究验证这种适用性。针对 ISW-RT 设计提出的样本量计算公式在大多数情况下都能产生令人满意的经验功率,但运行特征值接近边界的情况除外(即可能的最小周期数、自相关系数非常高或非常低)。总之,这些结果为 ISW-RT 的样本量计算提供了有益的启示。
{"title":"Sample Size Calculation for an Individual Stepped-Wedge Randomized Trial","authors":"Aude Allemang-Trivalle,&nbsp;Annabel Maruani,&nbsp;Bruno Giraudeau","doi":"10.1002/bimj.202300167","DOIUrl":"10.1002/bimj.202300167","url":null,"abstract":"<p>In the individual stepped-wedge randomized trial (ISW-RT), subjects are allocated to sequences, each sequence being defined by a control period followed by an experimental period. The total follow-up time is the same for all sequences, but the duration of the control and experimental periods varies among sequences. To our knowledge, there is no validated sample size calculation formula for ISW-RTs unlike stepped-wedge cluster randomized trials (SW-CRTs). The objective of this study was to adapt the formula used for SW-CRTs to the case of individual randomization and to validate this adaptation using a Monte Carlo simulation study. The proposed sample size calculation formula for an ISW-RT design yielded satisfactory empirical power for most scenarios except scenarios with operating characteristic values near the boundary (i.e., smallest possible number of periods, very high or very low autocorrelation coefficient). Overall, the results provide useful insights into the sample size calculation for ISW-RTs.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300167","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control 全基因组测序研究的生物统计方面:预处理和质量控制
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-11 DOI: 10.1002/bimj.202300278
Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler

Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.

高通量 DNA 测序技术的飞速发展促成了大规模的全基因组测序(WGS)研究。在进行表型与基因型之间的关联分析之前,需要对原始序列数据进行预处理和质量控制(QC)。由于许多生物统计学家至今尚未接触过 WGS 数据,因此我们首先简要介绍了 Illumina 的短线程测序技术。其次,我们解释了 WGS 研究的一般预处理流程。第三,我们概述了应用于 WGS 数据的重要 QC 指标:原始数据、映射和比对后、变异调用后和多样本变异调用后。第四,我们用汉堡-达沃斯基因测序研究(GENESIS-HD)的数据来说明质量控制,这项研究涉及 9000 多个人类全基因组。所有样本均在 Illumina NovaSeq 6000 上进行测序,采用无 PCR 方案,平均覆盖率为 35×。为了进行质量控制,对一个瓶中基因组(GIAB)三组进行了四次重复测序,一个 GIAB 样本在不同的运行中成功测序了 70 次。第五,我们提供了使用 DRAGEN 原始读存档(ORA)压缩原始数据的经验数据。应用中最重要的质量指标是遗传相似性、样本交叉污染、与预期 Het/Hom 比率的偏差、相关性和覆盖率。使用 DRAGEN ORA 对原始文件的压缩率为 5.6:1,压缩时间与基因组覆盖率成线性关系。总之,大型 WGS 研究的预处理、联合调用和质量控制在合理的时间内是可行的,高效的质量控制程序也是现成的。
{"title":"Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control","authors":"Raphael O. Betschart,&nbsp;Cristian Riccio,&nbsp;Domingo Aguilera-Garcia,&nbsp;Stefan Blankenberg,&nbsp;Linlin Guo,&nbsp;Holger Moch,&nbsp;Dagmar Seidl,&nbsp;Hugo Solleder,&nbsp;Felix Thalén,&nbsp;Alexandre Thiéry,&nbsp;Raphael Twerenbold,&nbsp;Tanja Zeller,&nbsp;Martin Zoche,&nbsp;Andreas Ziegler","doi":"10.1002/bimj.202300278","DOIUrl":"10.1002/bimj.202300278","url":null,"abstract":"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Functional Multivariable Logistic Regression With an Application to HIV Viral Suppression Prediction 应用于 HIV 病毒抑制预测的功能性多变量 Logistic 回归。
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-05 DOI: 10.1002/bimj.202300081
Siyuan Guo, Jiajia Zhang, Yichao Wu, Alexander C. McLain, James W. Hardin, Bankole Olatosi, Xiaoming Li

Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.

为了利用电子健康记录(EHR)数据改进对人类免疫缺陷病毒(HIV)抑制状态的预测,我们提出了一种功能多变量逻辑回归模型,该模型同时考虑了纵向二元过程和连续过程。具体来说,二元变量或连续变量的纵向测量数据均通过功能主成分分析建模,并利用其相应的功能主成分得分建立逻辑回归模型进行预测。纵向二元数据与底层高斯过程相关联。对于纵向连续和二进制数据,使用惩罚性样条曲线进行估计。利用组-拉索来选择纵向过程,并提出了多元函数主成分分析法来修正函数主成分得分的相关性。通过综合模拟研究对该方法进行了评估,然后将其应用于利用南卡罗来纳州艾滋病毒感染者的电子病历数据预测病毒抑制情况。
{"title":"Functional Multivariable Logistic Regression With an Application to HIV Viral Suppression Prediction","authors":"Siyuan Guo,&nbsp;Jiajia Zhang,&nbsp;Yichao Wu,&nbsp;Alexander C. McLain,&nbsp;James W. Hardin,&nbsp;Bankole Olatosi,&nbsp;Xiaoming Li","doi":"10.1002/bimj.202300081","DOIUrl":"10.1002/bimj.202300081","url":null,"abstract":"<p>Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141536065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Partial True Discovery Guarantee Procedures 结合部分真实发现保证程序。
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-02 DOI: 10.1002/bimj.202300075
Ningning Xu, Aldo Solari, Jelle J. Goeman

Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, “focus sets.” Still, the method also allows inference for feature sets chosen post hoc, that is, “nonfocus sets,” for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.

最近的研究表明,封闭测试是同时进行真实发现比例控制的最佳方法。然而,如何构建真正的发现保证程序,使用户根据自己的兴趣或专长选择的特征集集中功率,是一项挑战。我们提出了一种程序,允许用户将功率集中在预先指定的特征集上,即 "重点集"。此外,该方法还允许推断临时选择的特征集,即 "非重点集",我们通过内插法推断出真实发现的置信度下限。我们的程序是由部分真实发现保证程序与霍尔姆程序相结合建立的,是封闭测试程序的保守捷径。模拟研究证实,我们的方法对焦点集的统计能力相对较高,但对非焦点集的统计能力却不如人意。此外,我们还研究了具有特定结构的集合(如树和有向无环图)的统计能力特性。在可复制性分析方面,我们还将我们的方法与 AdaFilter 进行了比较。我们以基因表达数据中的基因本体分析为例,说明了我们方法的应用。
{"title":"Combining Partial True Discovery Guarantee Procedures","authors":"Ningning Xu,&nbsp;Aldo Solari,&nbsp;Jelle J. Goeman","doi":"10.1002/bimj.202300075","DOIUrl":"10.1002/bimj.202300075","url":null,"abstract":"<p>Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, “focus sets.” Still, the method also allows inference for feature sets chosen post hoc, that is, “nonfocus sets,” for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous Inference of Multiple Binary Endpoints in Biomedical Research: Small Sample Properties of Multiple Marginal Models and a Resampling Approach 生物医学研究中多个二元终点的同时推断:多重边际模型的小样本特性和重采样方法。
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-02 DOI: 10.1002/bimj.202300197
Sören Budig, Klaus Jung, Mario Hasler, Frank Schaarschmidt

In biomedical research, the simultaneous inference of multiple binary endpoints may be of interest. In such cases, an appropriate multiplicity adjustment is required that controls the family-wise error rate, which represents the probability of making incorrect test decisions. In this paper, we investigate two approaches that perform single-step p$p$-value adjustments that also take into account the possible correlation between endpoints. A rather novel and flexible approach known as multiple marginal models is considered, which is based on stacking of the parameter estimates of the marginal models and deriving their joint asymptotic distribution. We also investigate a nonparametric vector-based resampling approach, and we compare both approaches with the Bonferroni method by examining the family-wise error rate and power for different parameter settings, including low proportions and small sample sizes. The results show that the resampling-based approach consistently outperforms the other methods in terms of power, while still controlling the family-wise error rate. The multiple marginal models approach, on the other hand, shows a more conservative behavior. However, it offers more versatility in application, allowing for more complex models or straightforward computation of simultaneous confidence intervals. The practical application of the methods is demonstrated using a toxicological dataset from the National Toxicology Program.

在生物医学研究中,可能需要同时推断多个二元终点。在这种情况下,需要进行适当的多重性调整,以控制族内错误率,即做出错误测试决策的概率。在本文中,我们研究了两种进行单步 p $p$ 值调整的方法,它们还考虑到了终点之间可能存在的相关性。我们考虑了一种被称为多重边际模型的相当新颖和灵活的方法,它基于边际模型参数估计的堆叠,并推导出它们的联合渐近分布。我们还研究了一种基于向量的非参数重采样方法,并通过检验不同参数设置(包括低比例和小样本量)下的族内误差率和功率,将这两种方法与 Bonferroni 方法进行了比较。结果表明,基于重采样的方法在功率方面始终优于其他方法,同时还能控制族内误差率。而多重边际模型方法则表现得更为保守。不过,它的应用范围更广,可用于更复杂的模型或直接计算同步置信区间。我们使用国家毒理学计划的毒理学数据集演示了这些方法的实际应用。
{"title":"Simultaneous Inference of Multiple Binary Endpoints in Biomedical Research: Small Sample Properties of Multiple Marginal Models and a Resampling Approach","authors":"Sören Budig,&nbsp;Klaus Jung,&nbsp;Mario Hasler,&nbsp;Frank Schaarschmidt","doi":"10.1002/bimj.202300197","DOIUrl":"10.1002/bimj.202300197","url":null,"abstract":"<p>In biomedical research, the simultaneous inference of multiple binary endpoints may be of interest. In such cases, an appropriate multiplicity adjustment is required that controls the family-wise error rate, which represents the probability of making incorrect test decisions. In this paper, we investigate two approaches that perform single-step <span></span><math>\u0000 <semantics>\u0000 <mi>p</mi>\u0000 <annotation>$p$</annotation>\u0000 </semantics></math>-value adjustments that also take into account the possible correlation between endpoints. A rather novel and flexible approach known as multiple marginal models is considered, which is based on stacking of the parameter estimates of the marginal models and deriving their joint asymptotic distribution. We also investigate a nonparametric vector-based resampling approach, and we compare both approaches with the Bonferroni method by examining the family-wise error rate and power for different parameter settings, including low proportions and small sample sizes. The results show that the resampling-based approach consistently outperforms the other methods in terms of power, while still controlling the family-wise error rate. The multiple marginal models approach, on the other hand, shows a more conservative behavior. However, it offers more versatility in application, allowing for more complex models or straightforward computation of simultaneous confidence intervals. The practical application of the methods is demonstrated using a toxicological dataset from the National Toxicology Program.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300197","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Penalized Regression Methods With Modified Cross-Validation and Bootstrap Tuning Produce Better Prediction Models 采用修正交叉验证和 Bootstrap 调整的惩罚回归方法可生成更好的预测模型。
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-06-24 DOI: 10.1002/bimj.202300245
Menelaos Pavlou, Rumana Z. Omar, Gareth Ambler

Risk prediction models fitted using maximum likelihood estimation (MLE) are often overfitted resulting in predictions that are too extreme and a calibration slope (CS) less than 1. Penalized methods, such as Ridge and Lasso, have been suggested as a solution to this problem as they tend to shrink regression coefficients toward zero, resulting in predictions closer to the average. The amount of shrinkage is regulated by a tuning parameter, λ,$lambda ,$ commonly selected via cross-validation (“standard tuning”). Though penalized methods have been found to improve calibration on average, they often over-shrink and exhibit large variability in the selected λ$lambda $ and hence the CS. This is a problem, particularly for small sample sizes, but also when using sample sizes recommended to control overfitting. We consider whether these problems are partly due to selecting λ$lambda $ using cross-validation with “training” datasets of reduced size compared to the original development sample, resulting in an over-estimation of λ$lambda $ and, hence, excessive shrinkage. We propose a modified cross-validation tuning method (“modified tuning”), which estimates λ$lambda $ from a pseudo-development dataset obtained via bootstrapping from the original dataset, albeit of larger size, such that the resulting cross-validation training datasets are of the same size as the original dataset. Modified tuning can be easily implemented in standard software and is closely related to bootstrap selection of the tuning parameter (“bootstrap tuning”). We evaluated modified and bootstrap tuning for Ridge and Lasso in simulated and real data using recommended sample sizes, and sizes slightly lower and higher. They substantially improved the selection of λ$lambda $, resulting in improved CS compared to the standard tuning method. They also improved predictions compared to MLE.

使用最大似然估计(MLE)拟合的风险预测模型通常会过度拟合,导致预测结果过于极端,校准斜率(CS)小于 1。有人建议使用 Ridge 和 Lasso 等惩罚方法来解决这一问题,因为这些方法倾向于将回归系数缩减为零,从而使预测结果更接近平均值。缩减量由一个调整参数 λ , $lambda ,$ 来调节,通常通过交叉验证("标准调整")来选择。虽然已发现惩罚法平均可改善校准,但它们经常过度收缩,并在所选 λ $lambda $ 以及 CS 方面表现出很大的变异性。这是一个问题,尤其是在样本量较小的情况下,但在使用为控制过度拟合而推荐的样本量时也是如此。我们考虑这些问题是否部分是由于使用比原始开发样本更小的 "训练 "数据集进行交叉验证来选择 λ $/lambda$,从而导致过高估计 λ $/lambda$,进而导致过度缩减。我们提出了一种修改后的交叉验证调整方法("修改后的调整"),这种方法通过从原始数据集中引导得到的伪开发数据集来估计 λ $lambda$,尽管这个数据集的规模更大,这样得到的交叉验证训练数据集与原始数据集的规模相同。修正调整可以很容易地在标准软件中实现,并且与调整参数的自举选择("自举调整")密切相关。我们在模拟数据和真实数据中,使用推荐样本量以及略低于或略高于推荐样本量的样本,对 Ridge 和 Lasso 的修正调整和自举调整进行了评估。与标准调整方法相比,它们大大改进了 λ $lambda $ 的选择,从而改进了 CS。与 MLE 相比,他们还改进了预测结果。
{"title":"Penalized Regression Methods With Modified Cross-Validation and Bootstrap Tuning Produce Better Prediction Models","authors":"Menelaos Pavlou,&nbsp;Rumana Z. Omar,&nbsp;Gareth Ambler","doi":"10.1002/bimj.202300245","DOIUrl":"10.1002/bimj.202300245","url":null,"abstract":"<p>Risk prediction models fitted using maximum likelihood estimation (MLE) are often overfitted resulting in predictions that are too extreme and a calibration slope (CS) less than 1. Penalized methods, such as Ridge and Lasso, have been suggested as a solution to this problem as they tend to shrink regression coefficients toward zero, resulting in predictions closer to the average. The amount of shrinkage is regulated by a tuning parameter, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>λ</mi>\u0000 <mo>,</mo>\u0000 </mrow>\u0000 <annotation>$lambda ,$</annotation>\u0000 </semantics></math> commonly selected via cross-validation (“standard tuning”). Though penalized methods have been found to improve calibration on average, they often over-shrink and exhibit large variability in the selected <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and hence the CS. This is a problem, particularly for small sample sizes, but also when using sample sizes recommended to control overfitting. We consider whether these problems are partly due to selecting <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> using cross-validation with “training” datasets of reduced size compared to the original development sample, resulting in an over-estimation of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and, hence, excessive shrinkage. We propose a modified cross-validation tuning method (“modified tuning”), which estimates <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> from a pseudo-development dataset obtained via bootstrapping from the original dataset, albeit of larger size, such that the resulting cross-validation training datasets are of the same size as the original dataset. Modified tuning can be easily implemented in standard software and is closely related to bootstrap selection of the tuning parameter (“bootstrap tuning”). We evaluated modified and bootstrap tuning for Ridge and Lasso in simulated and real data using recommended sample sizes, and sizes slightly lower and higher. They substantially improved the selection of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math>, resulting in improved CS compared to the standard tuning method. They also improved predictions compared to MLE.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141460887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Issue Information: Biometrical Journal 5'24 发行信息:生物计量学杂志 5'24
IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-06-21 DOI: 10.1002/bimj.202470005
{"title":"Issue Information: Biometrical Journal 5'24","authors":"","doi":"10.1002/bimj.202470005","DOIUrl":"https://doi.org/10.1002/bimj.202470005","url":null,"abstract":"","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202470005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141439727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Causal inference in the absence of positivity: The role of overlap weights 缺乏正向性的因果推理:重叠权重的作用
IF 1.7 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-06-07 DOI: 10.1002/bimj.202300156
Roland A. Matsouaka, Yunji Zhou

How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.

We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.

Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.

当违反正向性假设时,如何分析数据?文献中存在几种可能的解决方案。在本文中,我们考虑了倾向得分(PS)方法,这些方法通常用于观察性研究,以评估违反正向性假设情况下的因果治疗效果。我们关注并研究了反概率加权(IPW)修剪和截断的四种具体替代方案:匹配权重(MW)、香农熵权重(EW)、重叠权重(OW)和贝塔权重(BW)估计器。我们首先明确其目标人群,即临床等效的患者人群,也就是我们有足够 PS 重叠的人群。然后,我们在不同的相应权重(和估计器)之间建立联系;这样我们就能突出这些估计器的共同特性和理论意义。最后,我们介绍了它们的增强估计器,这些估计器利用了倾向得分和结果回归模型的估计优势,在偏差和效率方面增强了治疗效果估计器。我们还阐明了 OW 估计器的作用,它是所有这些方法中针对重叠人群的旗舰方法。我们的分析结果表明,当存在中度或极端(随机或结构性)违反正向性假设的情况时,OW、MW 和 EW 比 IPW 和某些情况下的 BW 更优。然后,我们通过蒙特卡罗模拟对上述估计器的有限样本性能进行评估、比较和确认。最后,我们使用两个以违反正向性假设为特征的实际数据示例来说明这些方法。
{"title":"Causal inference in the absence of positivity: The role of overlap weights","authors":"Roland A. Matsouaka,&nbsp;Yunji Zhou","doi":"10.1002/bimj.202300156","DOIUrl":"10.1002/bimj.202300156","url":null,"abstract":"<p>How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.</p><p>We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.</p><p>Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values 自适应预测集线性模型:对有缺失值的数据集进行线性回归预测的免估算方法。
IF 1.7 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-05-30 DOI: 10.1002/bimj.202300090
Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe

Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.

线性回归(LR)广泛应用于生物医学和流行病学中连续结果的数据分析。尽管线性回归很受欢迎,但它与缺失数据不兼容,而缺失数据在健康科学中经常出现。在参数估计中,这一缺陷通常通过完整案例分析或估算来解决。然而,这两种变通方法都不足以进行预测,因为它们要么无法对不完整的记录进行预测,要么忽略了缺失导致的预测准确性下降,并且依赖于对缺失机制的(不切实际的)假设。在这里,我们推导出了自适应预测集线性模型(aps-lm),它无需估算就能对不完整数据进行预测。它是通过使用预测器选择操作、摩尔-彭罗斯(Moore-Penrose)伪逆和还原 QR 分解得出的。它应用于参考数据集(其中有完整的预测因子和结果),并产生一组保护隐私的参数。在第二阶段,这些参数将被共享,用于对外部数据集的结果进行预测,外部数据集中的预测因子有缺失项,无需估算。此外,即使在极端缺失的情况下,aps-lm 也能计算出考虑到缺失值模式的预测误差。我们在模拟研究中对 aps-lm 进行了基准测试。与流行的估算策略相比,aps-lm 在样本量、拟合度、缺失值类型和协方差结构等多种情况下都显示出更高的预测准确性和更小的偏差。最后,作为原理验证,我们将 aps-lm 应用于表观遗传衰老时钟,这种线性模型可以从表观遗传数据中预测一个人的生物年龄,具有良好的临床应用前景。
{"title":"Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values","authors":"Benjamin Planterose Jiménez,&nbsp;Manfred Kayser,&nbsp;Athina Vidaki,&nbsp;Amke Caliebe","doi":"10.1002/bimj.202300090","DOIUrl":"10.1002/bimj.202300090","url":null,"abstract":"<p>Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300090","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data 用于聚类和基因选择的贝叶斯分层隐马尔可夫模型:应用于肾癌基因表达数据。
IF 1.7 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-05-30 DOI: 10.1002/bimj.202300173
Thierry Chekouo, Himadri Mukherjee

We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.

我们介绍了一种贝叶斯双聚类方法,该方法利用隐马尔可夫模型(HMM)考虑了基因之间的先验功能依赖性。我们利用从基因本体和隐马尔可夫结构中收集的生物知识来捕捉相邻基因的潜在共表达。我们基于可解释模型的聚类方法通过三组特征来表征每个样本集群:过度表达、表达不足和无关特征。提出的方法已在 R 软件包中实现,并用于分析模拟数据和癌症基因组图谱肾癌数据。
{"title":"A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data","authors":"Thierry Chekouo,&nbsp;Himadri Mukherjee","doi":"10.1002/bimj.202300173","DOIUrl":"10.1002/bimj.202300173","url":null,"abstract":"<p>We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300173","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141181467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biometrical Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1