首页 > 最新文献

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
New algorithms for parking demand management and a city-scale deployment 停车需求管理和城市规模部署的新算法
O. Zoeter, C. Dance, S. Clinchant, J. Andreoli
On-street parking, just as any publicly owned utility, is used inefficiently if access is free or priced very far from market rates. This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. The new rates encourage parkers to avoid peak hours and peak locations and reduce congestion and underuse. The solution is deliberately simple so that it is easy to understand, easily seen to be fair and leads to parking policies that are easy to remember and act upon. We study the convergence properties of the iteration scheme and prove that it converges to a reasonable distribution for a very large class of models. The algorithm is in use to change parking rates in over 6000 spaces in downtown Los Angeles since June 2012 as part of the LA Express Park project. Initial results are encouraging with a reduction of congestion and underuse, while in more locations rates were decreased than increased.
就像任何公共事业一样,如果免费或定价与市场价格相去甚远,街边停车场的利用效率就会很低。本文介绍了一种新的需求管理解决方案:利用专用占用传感器的数据,迭代方案更新停车费率以更好地匹配需求。新的费率鼓励停车者避开高峰时间和高峰地点,减少拥堵和使用不足。解决方案故意简单,以便易于理解,容易看到是公平的,并导致停车政策,容易记住和行动。我们研究了迭代格式的收敛性,并证明了它收敛于一个非常大的模型类的合理分布。作为洛杉矶快速公园项目的一部分,自2012年6月以来,该算法已被用于改变洛杉矶市中心6000多个停车位的停车费率。初步结果令人鼓舞,减少了拥堵和未充分利用的情况,而在更多的地方,比率下降而不是增加。
{"title":"New algorithms for parking demand management and a city-scale deployment","authors":"O. Zoeter, C. Dance, S. Clinchant, J. Andreoli","doi":"10.1145/2623330.2623359","DOIUrl":"https://doi.org/10.1145/2623330.2623359","url":null,"abstract":"On-street parking, just as any publicly owned utility, is used inefficiently if access is free or priced very far from market rates. This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. The new rates encourage parkers to avoid peak hours and peak locations and reduce congestion and underuse. The solution is deliberately simple so that it is easy to understand, easily seen to be fair and leads to parking policies that are easy to remember and act upon. We study the convergence properties of the iteration scheme and prove that it converges to a reasonable distribution for a very large class of models. The algorithm is in use to change parking rates in over 6000 spaces in downtown Los Angeles since June 2012 as part of the LA Express Park project. Initial results are encouraging with a reduction of congestion and underuse, while in more locations rates were decreased than increased.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91400876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Data science through the lens of social science 从社会科学的角度看数据科学
D. Conway
In this talk, Drew will examine data science through the lens of the social scientist. He will discuss how the various skills and disciplines combine into data science. Drew will also present a motivating example directly from his work as a senior advisor to NYC's Mayor's Office of Analytics.
在这次演讲中,Drew将通过社会科学家的视角来审视数据科学。他将讨论各种技能和学科如何结合到数据科学中。德鲁还将从他作为纽约市市长分析办公室高级顾问的工作中直接展示一个鼓舞人心的例子。
{"title":"Data science through the lens of social science","authors":"D. Conway","doi":"10.1145/2623330.2630824","DOIUrl":"https://doi.org/10.1145/2623330.2630824","url":null,"abstract":"In this talk, Drew will examine data science through the lens of the social scientist. He will discuss how the various skills and disciplines combine into data science. Drew will also present a motivating example directly from his work as a senior advisor to NYC's Mayor's Office of Analytics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"51 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91519769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Personalized search result diversification via structured learning 通过结构化学习实现个性化搜索结果多样化
Shangsong Liang, Z. Ren, M. de Rijke
This paper is concerned with the problem of personalized diversification of search results, with the goal of enhancing the performance of both plain diversification and plain personalization algorithms. In previous work, the problem has mainly been tackled by means of unsupervised learning. To further enhance the performance, we propose a supervised learning strategy. Specifically, we set up a structured learning framework for conducting supervised personalized diversification, in which we add features extracted directly from the tokens of documents and those utilized by unsupervised personalized diversification algorithms, and, importantly, those generated from our proposed user-interest latent Dirichlet topic model. Based on our proposed topic model whether a document can cater to a user's interest can be estimated in our learning strategy. We also define two constraints in our structured learning framework to ensure that search results are both diversified and consistent with a user's interest. We conduct experiments on an open personalized diversification dataset and find that our supervised learning strategy outperforms unsupervised personalized diversification methods as well as other plain personalization and plain diversification methods.
本文研究了搜索结果的个性化多样化问题,目的是提高普通多样化和普通个性化算法的性能。在以前的工作中,这个问题主要是通过无监督学习来解决的。为了进一步提高性能,我们提出了一种监督学习策略。具体来说,我们建立了一个结构化的学习框架,用于进行有监督的个性化多样化,其中我们添加了直接从文档令牌中提取的特征,以及由无监督个性化多样化算法使用的特征,重要的是,从我们提出的用户兴趣潜在Dirichlet主题模型中生成的特征。基于我们提出的主题模型,可以在我们的学习策略中估计文档是否符合用户的兴趣。我们还在结构化学习框架中定义了两个约束,以确保搜索结果既多样化又与用户的兴趣一致。我们在一个开放的个性化多样化数据集上进行了实验,发现我们的监督学习策略优于无监督的个性化多样化方法以及其他普通的个性化和普通的多样化方法。
{"title":"Personalized search result diversification via structured learning","authors":"Shangsong Liang, Z. Ren, M. de Rijke","doi":"10.1145/2623330.2623650","DOIUrl":"https://doi.org/10.1145/2623330.2623650","url":null,"abstract":"This paper is concerned with the problem of personalized diversification of search results, with the goal of enhancing the performance of both plain diversification and plain personalization algorithms. In previous work, the problem has mainly been tackled by means of unsupervised learning. To further enhance the performance, we propose a supervised learning strategy. Specifically, we set up a structured learning framework for conducting supervised personalized diversification, in which we add features extracted directly from the tokens of documents and those utilized by unsupervised personalized diversification algorithms, and, importantly, those generated from our proposed user-interest latent Dirichlet topic model. Based on our proposed topic model whether a document can cater to a user's interest can be estimated in our learning strategy. We also define two constraints in our structured learning framework to ensure that search results are both diversified and consistent with a user's interest. We conduct experiments on an open personalized diversification dataset and find that our supervised learning strategy outperforms unsupervised personalized diversification methods as well as other plain personalization and plain diversification methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87670534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Gradient boosted feature selection 梯度增强特征选择
Z. Xu, Gao Huang, Kilian Q. Weinberger, A. Zheng
A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.
特征选择算法理想地满足四个条件:可靠地提取相关特征;能够识别非线性特征相互作用;与特征和维度的数量成线性比例;允许合并已知的稀疏性结构。在这项工作中,我们提出了一种新的特征选择算法,梯度增强特征选择(GBFS),它满足所有这四个要求。该算法是灵活的,可扩展的,并且令人惊讶的直接实现,因为它是基于梯度增强树的修改。我们在几个真实世界的数据集上评估了GBFS,并表明它匹配或优于其他最先进的特征选择算法。然而,它可以扩展到更大的数据集大小,并且自然地允许特定于领域的侧信息。
{"title":"Gradient boosted feature selection","authors":"Z. Xu, Gao Huang, Kilian Q. Weinberger, A. Zheng","doi":"10.1145/2623330.2623635","DOIUrl":"https://doi.org/10.1145/2623330.2623635","url":null,"abstract":"A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90167325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
Efficient mini-batch training for stochastic optimization 随机优化的高效小批量训练
Mu Li, T. Zhang, Yuqiang Chen, Alex Smola
Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
随机梯度下降(SGD)是机器学习中解决大规模优化问题的一种流行技术。为了实现SGD的并行化,需要采用小批量训练来降低通信成本。然而,小批量大小的增加通常会降低收敛速度。本文介绍了一种基于保守正则化目标函数在每个小批内近似优化的技术。我们证明了收敛速度不随小批大小的增加而降低。实验表明,通过适当的近似优化实现,所得算法在许多场景下都可以优于标准的SGD。
{"title":"Efficient mini-batch training for stochastic optimization","authors":"Mu Li, T. Zhang, Yuqiang Chen, Alex Smola","doi":"10.1145/2623330.2623612","DOIUrl":"https://doi.org/10.1145/2623330.2623612","url":null,"abstract":"Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"125 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90595013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 700
Scaling out big data missing value imputations: pythia vs. godzilla 扩展大数据缺失价值估算:皮提亚vs哥斯拉
C. Anagnostopoulos, P. Triantafillou
Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.
众所周知,在大数据环境中解决具有小估计误差的缺失值(MV)问题是一项耗费资源的任务。随着数据集及其用户社区的不断增长,这个问题只会加剧。假设有可能有一台机器(“哥斯拉”),它可以存储大量数据集并支持不断增长的社区提交MV插入请求。是否有可能通过使用大量队列机器来取代Godzilla,以便更快地执行推算,并行地参与队列,每个队列访问原始数据集的更小分区?如果是这样,出于明显的性能原因,最好是每个imputation只访问所有队列的一个子集。在这种情况下,我们能否迅速决定哪一个是每个imputation所需的队列子集?但效率和可扩展性只是一个关键问题!在保证与哥斯拉的估算误差相当甚至更好的情况下,是否有可能做到以上这些?在本文中,我们得出了这些基本问题的答案,并开发了原则性的方法和框架,这些方法和框架提供了较大的性能加速和更好的或与哥斯拉相当的误差,而与使用的缺失值插入算法无关。我们的贡献涉及Pythia,这是一个框架和算法,用于提供上述问题的答案,并用于每个MV imputation请求参与适当的队列子集。Pythia功能基于两个支柱:(i)数据集(分区)签名,每个队列一个;(ii)相似性概念和算法,可以识别要参与的队列的适当子集。对真实和合成数据集的全面实验展示了我们的效率、可扩展性和准确性。
{"title":"Scaling out big data missing value imputations: pythia vs. godzilla","authors":"C. Anagnostopoulos, P. Triantafillou","doi":"10.1145/2623330.2623615","DOIUrl":"https://doi.org/10.1145/2623330.2623615","url":null,"abstract":"Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85916114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Identifying tourists from public transport commuters 从公共交通通勤者中识别游客
Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh
Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.
旅游业已成为新加坡重要的经济驱动力。了解游客的行为对政府和私营部门,如餐馆、酒店和广告公司,改善他们现有的服务或创造新的商业机会非常重要。在与新加坡陆路交通管理局(LTA)的合作中,我们创新地应用机器学习技术,利用LTA提供的公共交通数据,在公共通勤者中识别游客。在成功识别后,游客的旅行模式就会被揭示出来,从而允许进行进一步的分析,比如他们最喜欢的目的地、停留地区等。在技术上,我们将游客识别建模为一个分类问题,并设计了一个迭代学习算法,在有限的先验知识和标记数据下进行推理。我们通过性能评估和与其他最先进的学习算法的比较来证明我们算法的优越性。此外,我们建立了一个交互式的基于网络的系统来回答有关游客移动模式的查询,这可以被利益相关者用来洞察游客在新加坡的旅行行为。
{"title":"Identifying tourists from public transport commuters","authors":"Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh","doi":"10.1145/2623330.2623352","DOIUrl":"https://doi.org/10.1145/2623330.2623352","url":null,"abstract":"Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86021061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Frontiers in E-commerce personalization 电子商务个性化的前沿
S. Subramaniam
E-commerce has largely been a 'Pull' model to date. Offline retailers have nailed discovery, delight, serendipity, and impulse purchases in person with greater success than online commerce sites. However, in an always-on, mobile-first world, companies like Groupon have the opportunity to push the frontier even further than offline retailers or comprehensive sites due to the fact that our smartphones are always with us. The challenge is to provide the right deals to the right user at the right time. That involves learning about the users and their locations, their personal preferences, and predicting which deals are likely to delight them, presenting diversity, discovery and engaging UX to gather user preferences and semantic graph approaches for user-deal matching. This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.
到目前为止,电子商务在很大程度上是一种“拉动”模式。线下零售商比在线商务网站更成功地抓住了发现、愉悦、意外发现和冲动购买的机会。然而,在一个永远在线、移动优先的世界里,像Groupon这样的公司有机会比线下零售商或综合网站走得更远,因为我们的智能手机总是与我们在一起。挑战在于如何在正确的时间为正确的用户提供正确的交易。这包括了解用户和他们的位置,他们的个人偏好,并预测哪些交易可能会让他们高兴,呈现多样性,发现和吸引用户体验,以收集用户偏好和用户交易匹配的语义图方法。本演讲将深入探讨Groupon如何通过数据驱动系统应对这些挑战,以取悦和惊喜客户。
{"title":"Frontiers in E-commerce personalization","authors":"S. Subramaniam","doi":"10.1145/2623330.2630820","DOIUrl":"https://doi.org/10.1145/2623330.2630820","url":null,"abstract":"E-commerce has largely been a 'Pull' model to date. Offline retailers have nailed discovery, delight, serendipity, and impulse purchases in person with greater success than online commerce sites. However, in an always-on, mobile-first world, companies like Groupon have the opportunity to push the frontier even further than offline retailers or comprehensive sites due to the fact that our smartphones are always with us. The challenge is to provide the right deals to the right user at the right time. That involves learning about the users and their locations, their personal preferences, and predicting which deals are likely to delight them, presenting diversity, discovery and engaging UX to gather user preferences and semantic graph approaches for user-deal matching. This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87944007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Relevant overlapping subspace clusters on categorical data 分类数据的相关重叠子空间聚类
Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant
Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.
聚类分类数据带来了一些独特的挑战:由于类别之间缺少顺序和间距,选择合适的相似性度量是一项困难的任务。许多现有的技术要求用户指定难以估计的输入参数。此外,许多技术仅限于在全维数据空间中检测聚类。子空间聚类的方法很少,结果冗余度很高。因此,我们提出了一种基于数据压缩思想的分类数据相关重叠子空间聚类(ROCAT)技术。根据最小描述长度原则,ROCAT在不需要任何输入参数的情况下自动检测最相关的子空间集群。每个集群的相关性通过其对压缩数据的贡献来验证。通过优化拟合优度和模型复杂性之间的权衡,ROCAT自动确定有意义的集群数量来表示数据。ROCAT特别设计用于检测分类数据上可能在对象和/或属性上重叠的子空间簇;也就是说,对象可以分配给不同子空间中的不同集群,属性可能有助于包含集群的不同子空间。只有在提高压缩率的情况下,ROCAT才允许重叠,从而避免了簇和子空间中不希望出现的冗余。大量的实验证明了该方法的有效性和高效性。
{"title":"Relevant overlapping subspace clusters on categorical data","authors":"Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant","doi":"10.1145/2623330.2623652","DOIUrl":"https://doi.org/10.1145/2623330.2623652","url":null,"abstract":"Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85718260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A dirichlet multinomial mixture model-based approach for short text clustering 基于dirichlet多项混合模型的短文本聚类方法
Jianhua Yin, Jianyong Wang
Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.
随着Twitter、Google+和Facebook等社交媒体的普及,短文本聚类已经成为一项越来越重要的任务。由于其稀疏、高维、大体积的特点,这是一个具有挑战性的问题。本文针对短文本聚类的Dirichlet多项式混合模型(简称GSDMM)提出了一种坍缩的Gibbs抽样算法。我们发现,GSDMM可以自动推断聚类的数量,在聚类结果的完备性和均匀性之间取得很好的平衡,收敛速度快。GSDMM还可以处理短文本的稀疏和高维问题,并可以获得每个聚类的代表词。我们广泛的实验研究表明,GSDMM可以获得明显优于其他三种聚类模型的性能。
{"title":"A dirichlet multinomial mixture model-based approach for short text clustering","authors":"Jianhua Yin, Jianyong Wang","doi":"10.1145/2623330.2623715","DOIUrl":"https://doi.org/10.1145/2623330.2623715","url":null,"abstract":"Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87623284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 443
期刊
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1