Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

IF 16.4 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Accounts of Chemical Research Pub Date : 2023-04-13 DOI:10.5705/ss.202021.0257

Shuyuan Wu, Xuening Zhu, Hansheng Wang

{"title":"Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources","authors":"Shuyuan Wu, Xuening Zhu, Hansheng Wang","doi":"10.5705/ss.202021.0257","DOIUrl":null,"url":null,"abstract":"Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.5705/ss.202021.0257","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 3

Abstract

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

子采样和折刀:计算资源有限的大数据分析的一种实用方便的解决方案

现代统计分析经常遇到大数据集。对于这些数据集，传统的估计方法很难立即使用，因为从业者经常受到计算资源有限的困扰。在大多数情况下，它们没有强大的计算资源(例如Hadoop或Spark)。如何在有限的计算资源下对大型数据集进行实际的分析就成为一个非常重要的问题。为了解决这一问题，我们提出了一种新的基于次采样的jackknife方法。关键思想是把整个样本数据当作总体来对待。然后，采用简单随机抽样带替换的方法，得到尺寸大大减小的多个子样本。值得注意的是，我们不建议不进行替换的抽样方法，因为这将导致硬盘上数据处理的巨大成本。如果数据在内存中处理，则不存在这种开销。由于次采样数据的大小相对较小，因此它们可以作为一个整体轻松地读入计算机存储器，然后很容易地进行处理。基于下采样数据集，可以得到目标参数的jackknife-debiased估计量。所得的估计量在统计上是一致的，偏差极小。最后，对来自不同子样本的jackknife-debiased估计量进行平均，形成最终估计量。我们从理论上证明了最终估计量是一致的和渐近正态的。在非常温和的条件下，它的渐近统计效率可与全样本估计器的统计效率相当。所提出的方法非常简单，易于在大多数实际的计算机系统上实现，因此应该具有非常广泛的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Accounts of Chemical Research 化学-化学综合

CiteScore

31.40

自引率

1.10%

发文量

312

审稿时长

2 months

期刊介绍： Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.

期刊最新文献

Management of Cholesteatoma: Hearing Rehabilitation. Congenital Cholesteatoma. Evaluation of Cholesteatoma. Management of Cholesteatoma: Extension Beyond Middle Ear/Mastoid. Recidivism and Recurrence.