2015 IEEE International Conference on Data Mining Workshop (ICDMW)最新文献

英文中文

Discovering Anomalies and Root Causes in Applications via Relevant Fields Analysis 通过相关领域分析发现应用中的异常和根本原因

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.68

Yuchen Zhao, Arjun Iyer, Ariel Smoliar

In this paper, we present a powerful end-to-end data mining system that collects application related data and provides insightful relevant fields analysis in addition to search and filtering. We present details on field extraction, indexing, relevant field processing and dynamic baseline derivation. We also propose to demonstrate the effectiveness of various scoring algorithms. Two real-world use cases show relevant fields analysis is effective to detect application anomalies and discover root causes of application incidents.

在本文中，我们提出了一个强大的端到端数据挖掘系统，除了搜索和过滤之外，还可以收集与应用程序相关的数据，并提供深刻的相关领域分析。详细介绍了字段提取、索引、相关字段处理和动态基线推导。我们还建议演示各种评分算法的有效性。两个真实的用例表明，相关领域分析对于检测应用程序异常和发现应用程序事件的根本原因是有效的。

引用次数: 1

Multiresolution Mutual Information Method for Social Network Entity Resolution 社会网络实体解析的多分辨率互信息方法

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.94

Cong Shi, Rong Duan

Online Social Networks (OSN) are widely adopted in our daily lives, and it is common for one individual to register with multiple sites for different services. Linking the rich contents of different social network sites is valuable to researchers for understanding human behaviors from different perspectives. For instance, each OSN has its own group of users and thus, has its own biases. Linked accounts can be a good calibration dataset to improve data quality. This Entity Resolution (ER) problem is a challenge in the social network domain that many researchers attempt to tackle. In this paper we take advantage of spatial information posted in different social network sites and propose an efficient multiresolution mutual information approach to link the entities from those sites. The proposed method significantly reduces the computing time by utilizing an iterative coarse-to-fine multiresolution approach, yet is robust in dealing with the sparsity of location data. The human location-wise behavior is also discussed in deciding the resolution level. Public available Twitter and Instagram data collected from their APIs are used to illustrate the method, and the performance is evaluated by comparing it with greedy mutual information approach.

在线社交网络(Online Social Networks, OSN)在我们的日常生活中被广泛采用，一个人在多个网站注册不同的服务是很常见的。将不同社交网站的丰富内容链接起来，对于研究人员从不同角度理解人类行为具有重要价值。例如，每个OSN都有自己的用户组，因此有自己的偏差。关联账户可以是一个很好的校准数据集，以提高数据质量。实体解析(ER)问题是社交网络领域许多研究者试图解决的难题。本文利用不同社交网站上发布的空间信息，提出了一种高效的多分辨率互信息方法来链接这些网站上的实体。该方法采用迭代的从粗到精的多分辨率方法，大大减少了计算时间，并且在处理位置数据的稀疏性方面具有鲁棒性。在确定分辨率水平时，还讨论了人类的位置智能行为。使用公开可用的Twitter和Instagram数据来说明该方法，并通过将其与贪婪互信息方法进行比较来评估性能。

{"title":"Multiresolution Mutual Information Method for Social Network Entity Resolution","authors":"Cong Shi, Rong Duan","doi":"10.1109/ICDMW.2015.94","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.94","url":null,"abstract":"Online Social Networks (OSN) are widely adopted in our daily lives, and it is common for one individual to register with multiple sites for different services. Linking the rich contents of different social network sites is valuable to researchers for understanding human behaviors from different perspectives. For instance, each OSN has its own group of users and thus, has its own biases. Linked accounts can be a good calibration dataset to improve data quality. This Entity Resolution (ER) problem is a challenge in the social network domain that many researchers attempt to tackle. In this paper we take advantage of spatial information posted in different social network sites and propose an efficient multiresolution mutual information approach to link the entities from those sites. The proposed method significantly reduces the computing time by utilizing an iterative coarse-to-fine multiresolution approach, yet is robust in dealing with the sparsity of location data. The human location-wise behavior is also discussed in deciding the resolution level. Public available Twitter and Instagram data collected from their APIs are used to illustrate the method, and the performance is evaluated by comparing it with greedy mutual information approach.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133943045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dynamic Community Detection Algorithm Based on Incremental Identification 基于增量识别的动态社区检测算法

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.158

Xiaoming Li, Bin Wu, Qian Guo, Xuelin Zeng, C. Shi

Dynamic community detection algorithms try to solve problems that identify communities of dynamic network which consists of a series of network snapshots. To address this issue, here we propose a new dynamic community detection algorithm based on incremental identification according to a vertex-based metric called permanence. We incrementally analyze the community ownership of partial vertices, so as to avoid the reassignment of all the vertices in the network to their respective communities. In addition, we propose a new metrics called evolution strength to measure the error probably caused by incrementally assigning the community ownership or the abrupt change of network structure. The experiment results show that our proposed algorithm is able to identify the community structure in a network with a higher efficiency. Meanwhile, due to the lack of dynamic network data with ground-truth structure and limitation of existing synthetic methods, we propose a novel method for generating synthetic data of dynamic network with ground-truth structure, which defines evolution events and evolution rate of events, so as to get more realistic synthetic data.

动态社区检测算法试图解决由一系列网络快照组成的动态网络中社区的识别问题。为了解决这个问题，我们提出了一种新的动态社区检测算法，该算法基于基于顶点的增量识别，称为持久性。我们逐步分析部分顶点的社区所有权，以避免网络中所有顶点重新分配到各自的社区。此外，我们还提出了一种新的度量进化强度的方法来度量由于社区所有权的增量分配或网络结构的突变可能引起的误差。实验结果表明，本文提出的算法能够以较高的效率识别网络中的社区结构。同时，针对具有地真结构的动态网络数据缺乏和现有合成方法的局限性，提出了一种生成具有地真结构的动态网络合成数据的新方法，该方法定义了进化事件和事件的进化速率，从而获得更真实的合成数据。

{"title":"Dynamic Community Detection Algorithm Based on Incremental Identification","authors":"Xiaoming Li, Bin Wu, Qian Guo, Xuelin Zeng, C. Shi","doi":"10.1109/ICDMW.2015.158","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.158","url":null,"abstract":"Dynamic community detection algorithms try to solve problems that identify communities of dynamic network which consists of a series of network snapshots. To address this issue, here we propose a new dynamic community detection algorithm based on incremental identification according to a vertex-based metric called permanence. We incrementally analyze the community ownership of partial vertices, so as to avoid the reassignment of all the vertices in the network to their respective communities. In addition, we propose a new metrics called evolution strength to measure the error probably caused by incrementally assigning the community ownership or the abrupt change of network structure. The experiment results show that our proposed algorithm is able to identify the community structure in a network with a higher efficiency. Meanwhile, due to the lack of dynamic network data with ground-truth structure and limitation of existing synthetic methods, we propose a novel method for generating synthetic data of dynamic network with ground-truth structure, which defines evolution events and evolution rate of events, so as to get more realistic synthetic data.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"55 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132117065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

A Stochastic Game Theoretic Model for Expanding ATM Services ATM业务扩展的随机博弈论模型

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.125

Raja Rathnam Naidu Kanapaka, Raghu Neelisetti

ATMs aim to extend essential banking services such as cash withdrawal and deposit beyond the working hours of a bank's branch. However, ATMs incur a significant significant cost overhead in the form of capital and operational costs. The problem of ATM location is further complicated as customers of one bank can use their debit cards at any other bank's ATMs. While this might attract charges, some banks often refund these charges to attract customers. Banks need to have a mechanism to quantitatively measure the benefits of managing their own ATM versus paying for services rendered to it's customers by other banks through their ATMs. Game theory is the study of strategic decision making and is an effective technique to identify the best business strategy when provided with multiple options. In this paper we propose a game theoretic model based on stochastic games to identify the best strategy to be adopted by banks for their ATM expansion. We further propose an algorithm to identify the idle locations where a bank should place an ATM should the result of the ATM game recommend that the bank should establish it's own ATM.

自动取款机的目的是在银行分支机构的工作时间之外提供基本的银行服务，如提取现金和存款。然而，自动柜员机在资本和运营成本方面产生了巨大的成本开销。由于一家银行的客户可以在任何一家银行的自动柜员机上使用借记卡，因此自动柜员机的位置问题更加复杂。虽然这可能会收取费用，但一些银行通常会退还这些费用以吸引客户。银行需要有一种机制来定量衡量管理自己的ATM与支付其他银行通过其ATM向客户提供的服务的好处。博弈论是对战略决策的研究，是在提供多种选择时确定最佳商业战略的有效技术。本文提出了一个基于随机博弈的博弈论模型，以确定银行ATM机扩张的最佳策略。我们进一步提出了一种算法，当自动取款机博弈的结果建议银行建立自己的自动取款机时，确定银行应该放置自动取款机的空闲位置。

引用次数: 1

Multi-Classes Feature Engineering with Sliding Window for Purchase Prediction in Mobile Commerce 基于滑动窗口的移动商务购买预测多类特征工程

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.172

Qiang Li, Maojie Gu, Keren Zhou, Xiaoming Sun

Mobile devices become more and more prevalent in recent years, especially in young groups. The rapid progress of mobile devices promotes the development of M-Commerce business. The purchase on mobile terminals accounts for a considerable percentage in the total trading volume of E-Commerce and begins to draw the attention of E-Commerce corporation. Alibaba held a Mobile Recommendation Algorithm Competition aiming to recommend appropriate items for mobile users at the right time and place. The dataset provided by Alibaba consists of about 6 billion operation logs made by 5 million Taobao users towards over 150 million items spanning a period of one month. Compared with traditional scenarios in purchase predicting, the competition raised three challenges: (1)The dataset is too large to be processed in personal computers, (2)Some days with great discounts provided by Taobao Marketplace are within the period of dataset, (3)Positive samples are too few compared to the dimension of features. In this paper we study the problem of predicting the purchase behaviour of M-Commerce users, by exploring the solution for Alibaba's Mobile Recommendation Algorithm Competition. We first deeply study the habit of customers and filter many outliers. After that we adopt the method of "sliding window" to supply positive samples of training dataset and smooth the burst of sales near Dec 12th. We design a feature engineering framework to extract 6 categories of features that aim to capture the buying potential of user-item pairs. Our features exploit the interaction of user-item pair, user's shopping habit and item' attraction for users. Then we apply Gradient Boost Decision Trees (GBDT) as the training model. In the end, we combine outputs of individual GBDT together by Logistic Regression to get the final predictions. Our solution achieves 8.66% F1 score, and ranks the third place in the final round.

近年来，移动设备变得越来越普遍，尤其是在年轻群体中。移动设备的快速进步促进了移动商务业务的发展。移动端采购在电子商务总交易额中占有相当大的比重，并开始引起电子商务企业的重视。阿里巴巴举办了移动推荐算法大赛，旨在为移动用户在合适的时间和地点推荐合适的商品。阿里巴巴提供的数据集包括500万淘宝用户在一个月内对1.5亿多件商品的约60亿次操作日志。与传统的购买预测场景相比，竞争提出了三个挑战:(1)数据集太大，无法在个人电脑上处理;(2)淘宝提供的大折扣天数在数据集的周期内;(3)与特征维数相比，正样本太少。本文通过探索阿里巴巴移动推荐算法竞赛的解决方案，研究移动商务用户购买行为预测问题。我们首先深入研究顾客的习惯，过滤掉很多异常值。之后我们采用“滑动窗口”的方法提供训练数据集的正样本，平滑12月12日附近的销售爆发。我们设计了一个特征工程框架来提取6类特征，旨在捕捉用户-物品对的购买潜力。我们的特征利用了用户-物品对的交互、用户的购物习惯和物品对用户的吸引力。然后应用梯度提升决策树(GBDT)作为训练模型。最后，我们通过逻辑回归将各个GBDT的输出组合在一起，得到最终的预测结果。我们的方案达到了8.66%的F1得分，在最后一轮中排名第三。

{"title":"Multi-Classes Feature Engineering with Sliding Window for Purchase Prediction in Mobile Commerce","authors":"Qiang Li, Maojie Gu, Keren Zhou, Xiaoming Sun","doi":"10.1109/ICDMW.2015.172","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.172","url":null,"abstract":"Mobile devices become more and more prevalent in recent years, especially in young groups. The rapid progress of mobile devices promotes the development of M-Commerce business. The purchase on mobile terminals accounts for a considerable percentage in the total trading volume of E-Commerce and begins to draw the attention of E-Commerce corporation. Alibaba held a Mobile Recommendation Algorithm Competition aiming to recommend appropriate items for mobile users at the right time and place. The dataset provided by Alibaba consists of about 6 billion operation logs made by 5 million Taobao users towards over 150 million items spanning a period of one month. Compared with traditional scenarios in purchase predicting, the competition raised three challenges: (1)The dataset is too large to be processed in personal computers, (2)Some days with great discounts provided by Taobao Marketplace are within the period of dataset, (3)Positive samples are too few compared to the dimension of features. In this paper we study the problem of predicting the purchase behaviour of M-Commerce users, by exploring the solution for Alibaba's Mobile Recommendation Algorithm Competition. We first deeply study the habit of customers and filter many outliers. After that we adopt the method of \"sliding window\" to supply positive samples of training dataset and smooth the burst of sales near Dec 12th. We design a feature engineering framework to extract 6 categories of features that aim to capture the buying potential of user-item pairs. Our features exploit the interaction of user-item pair, user's shopping habit and item' attraction for users. Then we apply Gradient Boost Decision Trees (GBDT) as the training model. In the end, we combine outputs of individual GBDT together by Logistic Regression to get the final predictions. Our solution achieves 8.66% F1 score, and ranks the third place in the final round.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134567183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Profit Maximization Analysis Based on Data Mining and the Exponential Retention Model Assumption with Respect to Customer Churn Problems 基于数据挖掘和指数保留模型假设的客户流失问题利润最大化分析

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.84

Zhaojing Zhang, R. Wang, Weihong Zheng, Shizhan Lan, D. Liang, Hao Jin

Confronted with fierce competition, an increasing number of telecommunication companies in China realize that they can increase proflts by reducing the rate of customer churn rather than attracting the same number of new customers. Recently, the availability of big data has increased, which has stimulated the development of data mining techniques. Identifying methods by which to maximize proflts is vital for operators based on big data. Novelly, this paper studies three key factors of the customer churn problem, namely, churn rate, prediction performance, and retention capability. We propose a proflt function that maximizes proflts under different conditions and obtain favorable results in applying it to sample data from China Mobile Communications Corporation. Theoretically, about 7.72 million Chinese Yuan per month can be obtained by applying proposed model to China Mobile Group Guangxi Company Limited, making our research of great economic value.

面对激烈的竞争，中国越来越多的电信公司意识到，他们可以通过降低客户流失率来增加利润，而不是吸引同样数量的新客户。近年来，大数据的可用性增加，刺激了数据挖掘技术的发展。对于基于大数据的运营商来说，确定利润最大化的方法至关重要。本文新颖地研究了客户流失问题的三个关键因素，即流失率、预测性能和保留能力。我们提出了在不同条件下利润最大化的proft函数，并将其应用到中国移动通信公司的样本数据中，取得了良好的效果。理论上，将该模型应用于中国移动集团广西有限公司，每月可获得约772万元人民币，具有较大的经济价值。

引用次数: 13

Signed Directed Social Network Analysis Applied to Group Conflict 签名导向社会网络分析在群体冲突中的应用

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.107

Q. Zheng, D. Skillicorn, O. Walther

Real-world social networks contain relationships of multiple different types, but this richness is often ignored in graph-theoretic modelling. We show how two recently developed spectral embedding techniques, for directed graphs (relationships are asymmetric) and for signed graphs (relationships are both positive and negative), can be combined. This combination is particularly appropriate for intelligence, terrorism, and law-enforcement applications. We illustrate by applying the novel embedding technique to datasets describing conflict in North-West Africa, and show how unusual interactions can be identified.

现实世界的社会网络包含多种不同类型的关系，但这种丰富性在图论建模中经常被忽略。我们展示了两种最近开发的频谱嵌入技术，用于有向图(关系是不对称的)和符号图(关系是正的和负的)，可以结合起来。这种组合特别适用于情报、恐怖主义和执法应用。我们通过将新的嵌入技术应用于描述西北非洲冲突的数据集来说明，并展示了如何识别不寻常的相互作用。

引用次数: 4

Estimating Taxi Demand-Supply Level Using Taxi Trajectory Data Stream 利用出租车轨迹数据流估计出租车供需水平

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.250

Dongxu Shao, Wei Wu, Shili Xiang, Yu Lu

Taxis provide a flexible and indispensable service to satisfy the urban travel demand of public commuters. Understanding taxi supply and commuter demand, especially the imbalance between the supply and the demand, would directly help to improve the quality of taxi service and eventually increase a city's traffic system efficiency. In this paper, we consider the taxi demand from a region during a period of time to include two parts: satisfied demand, i.e., passengers successfully receive taxi service during this period of time, and unmet demand, i.e., passengers are still waiting for taxi service. To properly estimate the demand-supply level (short for "the level of the taxi demand vs. supply imbalance"), we propose a novel indicator that reflects how fast an available taxi is taken in any given region. Accordingly, we design and implement a taxi analytics system to provide such information in near real time. Finally, we use the passenger waiting time survey data and the taxi streaming data to validate the proposed indicator on the built taxi analytics system.

出租车为满足公共通勤者的城市出行需求提供了一种灵活而不可或缺的服务。了解出租车供给和通勤需求，特别是供需失衡的问题，将直接有助于提高出租车服务质量，最终提高城市交通系统效率。在本文中，我们考虑一个地区在一段时间内的出租车需求，包括两部分:满足的需求，即乘客在这段时间内成功地获得了出租车服务;未满足的需求，即乘客仍在等待出租车服务。为了正确估计供需水平(简称“出租车需求与供应失衡水平”)，我们提出了一个新的指标，反映任何给定地区可用出租车的使用速度。因此，我们设计并实现了一个出租车分析系统，以近乎实时地提供这些信息。最后，我们使用乘客等待时间调查数据和出租车流数据在构建的出租车分析系统上验证了所提出的指标。

引用次数: 27

Reporting L Most Favorite Objects in Uncertain Databases with Probabilistic Reverse Top-k Queries 用概率反向Top-k查询报告不确定数据库中L个最喜欢的对象

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.47

Guoqing Xiao, Kenli Li, Keqin Li

Top-k queries are widely studied for identifying a ranked set of the k most interesting objects based on the individual user preference. Reverse top-k queries are proposed from the perspective of the product manufacturer, which are essential for manufacturers to assess the potential market and impacts of their products. However, the existing approaches for reverse top-k queries are all based on the assumption that the underlying data are exact. Due to the intrinsic differences between uncertain and certain data, these methods are designed only in certain databases and cannot be applied to uncertain case directly. Motivated by this, in this paper, we firstly model the probabilistic reverse top-k queries in the context of uncertain data. Moreover, we formulate the challenging problem of processing queries that report l most favorite objects to users, where impact factor of an object is defined as the cardinality of the probabilistic reverse top-k query result set. For speeding up the query, we exploit several properties of probabilistic threshold top-k queries and probabilistic skyline queries to reduce the solution space of this problem. In addition, an upper bound of the potential users is estimated to reduce the cost of computing the probabilistic reverse top-k queries for the candidate objects. Furthermore, effective pruning heuristics are presented to further reduce the search space of query processing. Finally, efficient query algorithms are presented seamlessly with integration of the proposed pruning strategies. Extensive experiments demonstrate the efficiency and effectiveness of our proposed algorithms with various experimental settings.

Top-k查询被广泛研究，用于根据个人用户偏好确定k个最有趣对象的排序集。从产品制造商的角度提出反向top-k查询，这对于制造商评估其产品的潜在市场和影响至关重要。然而，现有的反向top-k查询方法都是基于底层数据是精确的假设。由于不确定数据与确定数据的本质区别，这些方法仅针对特定数据库设计，不能直接应用于不确定情况。基于此，本文首先对不确定数据背景下的概率反向top-k查询进行建模。此外，我们还提出了一个具有挑战性的问题，即处理向用户报告l个最喜欢的对象的查询，其中对象的影响因子被定义为概率反向top-k查询结果集的基数。为了加快查询速度，我们利用了概率阈值top-k查询和概率天际线查询的一些特性来减小该问题的解空间。此外，还估计了潜在用户的上限，以减少计算候选对象的概率反向top-k查询的成本。在此基础上，提出了有效的剪枝启发式算法，进一步缩小查询处理的搜索空间。最后，结合所提出的修剪策略，无缝地提出了高效的查询算法。大量的实验证明了我们提出的算法在各种实验设置下的效率和有效性。

{"title":"Reporting L Most Favorite Objects in Uncertain Databases with Probabilistic Reverse Top-k Queries","authors":"Guoqing Xiao, Kenli Li, Keqin Li","doi":"10.1109/ICDMW.2015.47","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.47","url":null,"abstract":"Top-k queries are widely studied for identifying a ranked set of the k most interesting objects based on the individual user preference. Reverse top-k queries are proposed from the perspective of the product manufacturer, which are essential for manufacturers to assess the potential market and impacts of their products. However, the existing approaches for reverse top-k queries are all based on the assumption that the underlying data are exact. Due to the intrinsic differences between uncertain and certain data, these methods are designed only in certain databases and cannot be applied to uncertain case directly. Motivated by this, in this paper, we firstly model the probabilistic reverse top-k queries in the context of uncertain data. Moreover, we formulate the challenging problem of processing queries that report l most favorite objects to users, where impact factor of an object is defined as the cardinality of the probabilistic reverse top-k query result set. For speeding up the query, we exploit several properties of probabilistic threshold top-k queries and probabilistic skyline queries to reduce the solution space of this problem. In addition, an upper bound of the potential users is estimated to reduce the cost of computing the probabilistic reverse top-k queries for the candidate objects. Furthermore, effective pruning heuristics are presented to further reduce the search space of query processing. Finally, efficient query algorithms are presented seamlessly with integration of the proposed pruning strategies. Extensive experiments demonstrate the efficiency and effectiveness of our proposed algorithms with various experimental settings.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The Hierarchical Model to Ali Mobile Recommendation Competition 阿里移动推荐大赛的层次模型

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.75

Suchi Qian, Furong Peng, Xiang Li, Jianfeng Lu

Recommendation Engines have gained the most attention in the Big Data world. In order to promote the application of big data, AlibabaGrouporganizedthebig data recommendation competition, which provides the big data processing platform and one billion behavior records to participants. The competition requires the participants to learn the model from the user's behaviors within one month and then predict the purchase behavior in the following day. There are four kinds of different behaviors included: browse, add-to-cart, collection and purchase. The F1-score is as the metric to evaluate the performance. Finally, our team achieves the top score of 8.78%, and our success can be owed to the following aspects: First, we model the recommendation problem as the binary classification problem and design the hierarchical model, Second, in order to improve performance of single classifier, we adopt the sample filtering strategy to select valuable samples for training, which not only boosts the performance but also speeds up the training, Third, the classifier fusion strategy is used to improve the final performance. This paper details our hierarchical model and some relevant key technologies adopted for this competition. This hierarchical model is also the framework of data processing, which is composed of four layers: 1) Sample filtering layer, which removes a large number of invaluable samples and reduces the computing complexity, 2) Feature extraction layer, which extracts extensive features so as to characterize the samples from all possible views, 3) Classifying layer, which trains several classifiers by different sampling strategy and feature groups, 4) Fusion layers, which fuses the results of different classifiers to obtain the better one. Our score in competition manifests the reasonableness and feasibility of our model.

推荐引擎在大数据领域获得了最多的关注。为了促进大数据的应用，阿里巴巴集团组织了大数据推荐大赛，为参赛者提供大数据处理平台和10亿条行为记录。比赛要求参赛者在一个月内从用户的行为中学习模型，然后预测第二天的购买行为。其中包括四种不同的行为:浏览、添加到购物车、收集和购买。f1得分作为评估性能的指标。最终，我们的团队取得了8.78%的最高分，我们的成功可以归功于以下几个方面:首先，我们将推荐问题建模为二值分类问题，并设计了分层模型;其次，为了提高单分类器的性能，我们采用样本过滤策略来选择有价值的样本进行训练，既提高了性能又加快了训练速度;第三，采用分类器融合策略来提高最终的性能。本文详细介绍了我们的分层模型和本次比赛所采用的一些相关关键技术。这种分层模型也是数据处理的框架，它由四层组成:1)样本过滤层，去除大量宝贵的样本，降低计算复杂度;2)特征提取层，提取广泛的特征，从所有可能的角度对样本进行表征;3)分类层，通过不同的采样策略和特征组训练多个分类器;4)融合层，融合不同分类器的结果，获得更好的分类器。我们在比赛中的得分体现了我们的模式的合理性和可行性。

{"title":"The Hierarchical Model to Ali Mobile Recommendation Competition","authors":"Suchi Qian, Furong Peng, Xiang Li, Jianfeng Lu","doi":"10.1109/ICDMW.2015.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.75","url":null,"abstract":"Recommendation Engines have gained the most attention in the Big Data world. In order to promote the application of big data, AlibabaGrouporganizedthebig data recommendation competition, which provides the big data processing platform and one billion behavior records to participants. The competition requires the participants to learn the model from the user's behaviors within one month and then predict the purchase behavior in the following day. There are four kinds of different behaviors included: browse, add-to-cart, collection and purchase. The F1-score is as the metric to evaluate the performance. Finally, our team achieves the top score of 8.78%, and our success can be owed to the following aspects: First, we model the recommendation problem as the binary classification problem and design the hierarchical model, Second, in order to improve performance of single classifier, we adopt the sample filtering strategy to select valuable samples for training, which not only boosts the performance but also speeds up the training, Third, the classifier fusion strategy is used to improve the final performance. This paper details our hierarchical model and some relevant key technologies adopted for this competition. This hierarchical model is also the framework of data processing, which is composed of four layers: 1) Sample filtering layer, which removes a large number of invaluable samples and reduces the computing complexity, 2) Feature extraction layer, which extracts extensive features so as to characterize the samples from all possible views, 3) Classifying layer, which trains several classifiers by different sampling strategy and feature groups, 4) Fusion layers, which fuses the results of different classifiers to obtain the better one. Our score in competition manifests the reasonableness and feasibility of our model.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115939547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀