A Note on the Required Sample Size of Model-Based Dose-Finding Methods for Molecularly Targeted Agents

Austin biometrics and biostatistics Pub Date : 2021-03-03 DOI:10.26420/AUSTINBIOMANDBIOSTAT.2021.1037

S. Hong, Ying Sun, H. Li, Lynn Hs

{"title":"A Note on the Required Sample Size of Model-Based Dose-Finding Methods for Molecularly Targeted Agents","authors":"S. Hong, Ying Sun, H. Li, Lynn Hs","doi":"10.26420/AUSTINBIOMANDBIOSTAT.2021.1037","DOIUrl":null,"url":null,"abstract":"Random forest has proven to be a successful machine learning method, but it also can be time-consuming for handling large datasets, especially for doing iterative tasks. Machine learning iterative imputation methods have been well accepted by researchers for imputing missing data, but such methods can be more time-consuming than standard imputation methods. To overcome this drawback, different parallel computing strategies have been proposed but their impact on imputation results and subsequent statistical analyses are relatively unknown. Newly proposed random forest implementations, such as ranger and randomForestSRC, have provided alternatives for easier parallelization, but their validity for doing iterative imputation are still unclear. Using random-forest imputation algorithm missForest as an example, this study examines two parallelized methods using newly proposed random forest implementations in comparison with the two parallel strategies (variable-wise distributed computation and model-wise distributed computation) using language-level parallelization from the software package. Results from the simulation experiments showed that the parallel strategies could influence both the imputation process and the final imputation results differently. Different parallel strategies can improve computational speed to a variable extent, and based on simulations, ranger can provide performance boost for datasets of different sizes with reasonable accuracy. Specifically, even though different strategies can produce similar normalized root mean squared prediction errors, the variable-wise distributed strategy led to additional biases when estimating the mean and inter-correlation of the covariates and their regression coefficients. And parallelization by randomForestSRC can lead to changes in both prediction errors and estimates.","PeriodicalId":91208,"journal":{"name":"Austin biometrics and biostatistics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Austin biometrics and biostatistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26420/AUSTINBIOMANDBIOSTAT.2021.1037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Random forest has proven to be a successful machine learning method, but it also can be time-consuming for handling large datasets, especially for doing iterative tasks. Machine learning iterative imputation methods have been well accepted by researchers for imputing missing data, but such methods can be more time-consuming than standard imputation methods. To overcome this drawback, different parallel computing strategies have been proposed but their impact on imputation results and subsequent statistical analyses are relatively unknown. Newly proposed random forest implementations, such as ranger and randomForestSRC, have provided alternatives for easier parallelization, but their validity for doing iterative imputation are still unclear. Using random-forest imputation algorithm missForest as an example, this study examines two parallelized methods using newly proposed random forest implementations in comparison with the two parallel strategies (variable-wise distributed computation and model-wise distributed computation) using language-level parallelization from the software package. Results from the simulation experiments showed that the parallel strategies could influence both the imputation process and the final imputation results differently. Different parallel strategies can improve computational speed to a variable extent, and based on simulations, ranger can provide performance boost for datasets of different sizes with reasonable accuracy. Specifically, even though different strategies can produce similar normalized root mean squared prediction errors, the variable-wise distributed strategy led to additional biases when estimating the mean and inter-correlation of the covariates and their regression coefficients. And parallelization by randomForestSRC can lead to changes in both prediction errors and estimates.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于分子靶向药物基于模型的剂量测定方法所需样本量的注记

随机森林已被证明是一种成功的机器学习方法，但它在处理大型数据集时也可能很耗时，尤其是在执行迭代任务时。机器学习迭代插补方法已被研究人员广泛接受，用于插补缺失数据，但这种方法可能比标准插补方法更耗时。为了克服这一缺点，已经提出了不同的并行计算策略，但它们对插补结果和随后的统计分析的影响相对未知。新提出的随机森林实现，如ranger和randomForestSRC，为更容易的并行化提供了替代方案，但它们在进行迭代插补方面的有效性仍不清楚。以随机森林插补算法missForest为例，本研究考察了两种使用新提出的随机森林实现的并行化方法，并与软件包中使用语言级并行化的两种并行策略（变量分布式计算和模型分布式计算）进行了比较。模拟实验结果表明，并行策略对插补过程和最终插补结果的影响不同。不同的并行策略可以在不同程度上提高计算速度，并且基于仿真，ranger可以以合理的精度为不同大小的数据集提供性能提升。具体而言，即使不同的策略可以产生相似的归一化均方根预测误差，但在估计协变量及其回归系数的平均值和互相关时，按变量分布的策略也会导致额外的偏差。randomForestSRC的并行化可以导致预测误差和估计值的变化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Austin biometrics and biostatistics

自引率

0.00%

发文量