MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, Matteo Riondato
{"title":"MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining","authors":"Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, Matteo Riondato","doi":"10.1145/3532187","DOIUrl":null,"url":null,"abstract":"“I’m an MC still as honest” – Eminem, Rap God We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3532187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

“I’m an MC still as honest” – Eminem, Rap God We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Poset族的蒙特卡洛Rademacher平均和近似模式挖掘
我们提出了MCRapper,一种有效计算蒙特卡罗经验Rademacher平均(MCERA)的算法,用于显示偏序集(例如,晶格)结构的函数族,例如在许多模式挖掘任务中出现的函数族。MCERA允许我们计算样本均值与其期望的最大偏差的上界,因此它可以用来找到(1)当可用数据被视为来自未知分布的样本时,具有统计显著性的函数(即模式),以及(2)当可用数据是来自大数据集的小样本时,高期望函数集合的近似值(例如,频繁模式)。与以前提出的解决方案相比,MCRapper提供的这种灵活性是一个很大的优势,以前的解决方案只能实现两者中的一个。MCRapper利用函数差异的上界来有效地探索和修剪搜索空间,这是一种借鉴于模式挖掘本身的技术。为了展示MCRapper的实际用途,我们利用它开发了一种算法TFP- r,用于真频繁模式(TFP)挖掘任务,通过适当计算感兴趣模式集合的负边界和正边界的近似,从而允许对模式空间进行有效修剪和计算最大偏差的强界。TFP-R可以保证包含任何误报的概率(精度),并且比提供相同保证的现有方法显示出更高的统计能力(召回率)。我们评估了MCRapper和TFP-R,并表明它们在各自的任务中表现优于最先进的技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Machine Learning-based Short-term Rainfall Prediction from Sky Data Incremental Feature Spaces Learning with Label Scarcity Multi-objective Learning to Overcome Catastrophic Forgetting in Time-series Applications Combining Filtering and Cross-Correlation Efficiently for Streaming Time Series Segment-Wise Time-Varying Dynamic Bayesian Network with Graph Regularization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1