Estimation for monotone sampling: competitiveness and customization

E. Cohen
{"title":"Estimation for monotone sampling: competitiveness and customization","authors":"E. Cohen","doi":"10.1145/2611462.2611485","DOIUrl":null,"url":null,"abstract":"Random samples are lossy summaries which allow queries posed over the data to be approximated by applying an appropriate estimator to the sample. The effectiveness of sampling, however, hinges on estimator selection. The choice of estimators is subjected to global requirements, such as unbiasedness and range restrictions on the estimate value, and ideally, we seek estimators that are both efficient to derive and apply and admissible (not dominated, in terms of variance, by other estimators). Nevertheless, for a given data domain, sampling scheme, and query, there are many admissible estimators. We define monotone sampling, which is implicit in many applications of massive data set analysis, and study the choice of admissible nonnegative and unbiased estimators. Our main contribution is general derivations of admissible estimators with desirable properties. We present a construction of order-optimal estimators, which minimize variance according to {\\em any} specified priorities over the data domain. Order-optimality allows us to customize the derivation to common patterns that we can learn or observe in the data. When we prioritize lower values (e.g., more similar data sets when estimating difference), we obtain the L* estimator, which is the unique monotone admissible estimator and dominates the classic Horvitz-Thompson estimator. We show that the L* estimator is 4-competitive, meaning that the expectation of the square, on any data, is at most $4$ times the minimum possible for that data. These properties make the L* estimator a natural default choice. We also present the U$^*$ estimator, which prioritizes large values (e.g., less similar data sets). Our estimator constructions are general, natural, and practical, allowing us to make the most from our summarized data.","PeriodicalId":186800,"journal":{"name":"Proceedings of the 2014 ACM symposium on Principles of distributed computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 ACM symposium on Principles of distributed computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2611462.2611485","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Random samples are lossy summaries which allow queries posed over the data to be approximated by applying an appropriate estimator to the sample. The effectiveness of sampling, however, hinges on estimator selection. The choice of estimators is subjected to global requirements, such as unbiasedness and range restrictions on the estimate value, and ideally, we seek estimators that are both efficient to derive and apply and admissible (not dominated, in terms of variance, by other estimators). Nevertheless, for a given data domain, sampling scheme, and query, there are many admissible estimators. We define monotone sampling, which is implicit in many applications of massive data set analysis, and study the choice of admissible nonnegative and unbiased estimators. Our main contribution is general derivations of admissible estimators with desirable properties. We present a construction of order-optimal estimators, which minimize variance according to {\em any} specified priorities over the data domain. Order-optimality allows us to customize the derivation to common patterns that we can learn or observe in the data. When we prioritize lower values (e.g., more similar data sets when estimating difference), we obtain the L* estimator, which is the unique monotone admissible estimator and dominates the classic Horvitz-Thompson estimator. We show that the L* estimator is 4-competitive, meaning that the expectation of the square, on any data, is at most $4$ times the minimum possible for that data. These properties make the L* estimator a natural default choice. We also present the U$^*$ estimator, which prioritizes large values (e.g., less similar data sets). Our estimator constructions are general, natural, and practical, allowing us to make the most from our summarized data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
单调抽样的估计:竞争力和定制
随机样本是有损摘要,它允许通过对样本应用适当的估计器来逼近对数据提出的查询。然而,抽样的有效性取决于估计器的选择。估计量的选择受全局需求的影响,比如估计值的无偏性和范围限制,理想情况下,我们寻求既能有效地推导和应用又能被接受的估计量(就方差而言,不受其他估计量的支配)。然而,对于给定的数据域、抽样方案和查询,存在许多可接受的估计量。我们定义了在大量数据集分析的许多应用中隐含的单调抽样,并研究了可容许的非负无偏估计量的选择。我们的主要贡献是具有理想性质的可容许估计量的一般推导。我们提出了一个序最优估计器的构造,它在数据域上根据{\em任意}指定的优先级最小化方差。顺序最优性允许我们将派生定制为我们可以在数据中学习或观察到的公共模式。当我们优先考虑较低的值时(例如,在估计差时更相似的数据集),我们得到L*估计量,它是唯一的单调可容许估计量,并且优于经典的Horvitz-Thompson估计量。我们证明了L*估计器是4竞争的,这意味着任何数据上的平方期望最多是该数据的最小可能值的4倍。这些属性使L*估计器成为自然的默认选择。我们还提出了U$^*$估计器,它优先考虑大的值(例如,不太相似的数据集)。我们的估计器结构是通用的、自然的和实用的,允许我们从总结的数据中获得最大的收益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Session details: Session 10 The future(s) of shared data structures Session details: Session 12 Software-improved hardware lock elision On the power of the congested clique model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1