Real-Time Clustering for Large Sparse Online Visitor Data

G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire
{"title":"Real-Time Clustering for Large Sparse Online Visitor Data","authors":"G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire","doi":"10.1145/3366423.3380183","DOIUrl":null,"url":null,"abstract":"Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The Web Conference 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366423.3380183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型稀疏在线访问者数据的实时聚类
在线访问者行为通常被建模为一个大的稀疏矩阵,其中行表示访问者,列表示行为。为了发现具有不同层次结构的客户群,营销人员通常需要将数据聚类在不同的细分中。这种分析需要聚类算法对用户参数变化提供实时响应,这是当前技术无法支持的。本文针对大规模稀疏数据,提出了一种实时聚类算法——稀疏密度峰算法。它对输入点进行预处理以计算注释和集群分配的层次结构。虽然赋值只是对点进行一次扫描,但简单的预处理需要测量所有的成对距离,这会产生二次计算开销,并且对于任何中等大小的数据都是不可行的。因此,我们提出了一种基于MinHash和LSH的新方法,可以提供快速准确的估计。我们还描述了一个在Spark上解决数据倾斜和内存使用的高效实现。我们的实验表明,与直接的MinHash和LSH实现相比,我们的方法(1)在真实数据集的准确性方面提供了更好的近似,(2)在端到端聚类管道中实现了20倍的加速,(3)可以使用较小的内存维持计算。最后,我们提供了一个界面,从数百万在线访客记录中实时探索客户细分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Gone, Gone, but Not Really, and Gone, But Not forgotten: A Typology of Website Recoverability Those who are left behind: A chronicle of internet access in Cuba Towards Automated Technologies in the Referencing Quality of Wikidata Companion of The Web Conference 2022, Virtual Event / Lyon, France, April 25 - 29, 2022 WWW '21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1