基于密度估计的方法确定大数据随机抽样分区的样本量

IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Frontiers of Computer Science Pub Date : 2023-12-16 DOI:10.1007/s11704-023-2356-x
{"title":"基于密度估计的方法确定大数据随机抽样分区的样本量","authors":"","doi":"10.1007/s11704-023-2356-x","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Density estimation-based method to determine sample size for random sample partition of big data\",\"authors\":\"\",\"doi\":\"10.1007/s11704-023-2356-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>\",\"PeriodicalId\":12640,\"journal\":{\"name\":\"Frontiers of Computer Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11704-023-2356-x\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-2356-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

摘要 随机抽样分区(RSP)是一种新开发的大数据表示和管理模型,用于处理大数据近似计算问题。学术研究和实际应用证实,RSP 是一种高效的大数据处理和分析解决方案。然而,实施 RSP 的一个挑战是确定 RSP 数据块的适当样本大小。样本量大会增加大数据计算的负担,而样本量小又会导致 RSP 数据块的分布信息不足。为解决这一问题,本文提出了一种新颖的基于密度估计的方法(DEM)来确定 RSP 数据块的最佳样本量。首先,根据多元 Dvoretzky-Kiefer-Wolfowitz (DKW) 不等式,使用定点迭代 (FPI) 方法计算出理论样本量。其次,通过最小化在 RSP 数据块上构建的核密度估算器 (KDE) 的验证误差来确定实际样本量,以增加样本量。最后,通过一系列有说服力的实验来验证 DEM 的可行性、合理性和有效性。实验结果表明:(1) 从多元 DKW 不等式计算理论样本量时,FPI 方法的迭代函数是收敛的;(2) 在 RSP 数据块上构建的 KDE,其样本量由 DEM 确定,可以很好地近似概率密度函数(p.d.f.);(3) 从 p.d.f. 估计的角度来看,DEM 比现有的样本量确定方法提供了更精确的样本量。这表明,DEM 是处理大数据 RSP 实施中样本量确定问题的一种可行方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Density estimation-based method to determine sample size for random sample partition of big data

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Frontiers of Computer Science
Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
8.60
自引率
2.40%
发文量
799
审稿时长
6-12 weeks
期刊介绍: Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.
期刊最新文献
A comprehensive survey of federated transfer learning: challenges, methods and applications DMFVAE: miRNA-disease associations prediction based on deep matrix factorization method with variational autoencoder Graph foundation model ABLkit: a Python toolkit for abductive learning SEOE: an option graph based semantically embedding method for prenatal depression detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1