In this paper, we present a powerful end-to-end data mining system that collects application related data and provides insightful relevant fields analysis in addition to search and filtering. We present details on field extraction, indexing, relevant field processing and dynamic baseline derivation. We also propose to demonstrate the effectiveness of various scoring algorithms. Two real-world use cases show relevant fields analysis is effective to detect application anomalies and discover root causes of application incidents.
{"title":"Discovering Anomalies and Root Causes in Applications via Relevant Fields Analysis","authors":"Yuchen Zhao, Arjun Iyer, Ariel Smoliar","doi":"10.1109/ICDMW.2015.68","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.68","url":null,"abstract":"In this paper, we present a powerful end-to-end data mining system that collects application related data and provides insightful relevant fields analysis in addition to search and filtering. We present details on field extraction, indexing, relevant field processing and dynamic baseline derivation. We also propose to demonstrate the effectiveness of various scoring algorithms. Two real-world use cases show relevant fields analysis is effective to detect application anomalies and discover root causes of application incidents.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"18 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132810810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online Social Networks (OSN) are widely adopted in our daily lives, and it is common for one individual to register with multiple sites for different services. Linking the rich contents of different social network sites is valuable to researchers for understanding human behaviors from different perspectives. For instance, each OSN has its own group of users and thus, has its own biases. Linked accounts can be a good calibration dataset to improve data quality. This Entity Resolution (ER) problem is a challenge in the social network domain that many researchers attempt to tackle. In this paper we take advantage of spatial information posted in different social network sites and propose an efficient multiresolution mutual information approach to link the entities from those sites. The proposed method significantly reduces the computing time by utilizing an iterative coarse-to-fine multiresolution approach, yet is robust in dealing with the sparsity of location data. The human location-wise behavior is also discussed in deciding the resolution level. Public available Twitter and Instagram data collected from their APIs are used to illustrate the method, and the performance is evaluated by comparing it with greedy mutual information approach.
在线社交网络(Online Social Networks, OSN)在我们的日常生活中被广泛采用,一个人在多个网站注册不同的服务是很常见的。将不同社交网站的丰富内容链接起来,对于研究人员从不同角度理解人类行为具有重要价值。例如,每个OSN都有自己的用户组,因此有自己的偏差。关联账户可以是一个很好的校准数据集,以提高数据质量。实体解析(ER)问题是社交网络领域许多研究者试图解决的难题。本文利用不同社交网站上发布的空间信息,提出了一种高效的多分辨率互信息方法来链接这些网站上的实体。该方法采用迭代的从粗到精的多分辨率方法,大大减少了计算时间,并且在处理位置数据的稀疏性方面具有鲁棒性。在确定分辨率水平时,还讨论了人类的位置智能行为。使用公开可用的Twitter和Instagram数据来说明该方法,并通过将其与贪婪互信息方法进行比较来评估性能。
{"title":"Multiresolution Mutual Information Method for Social Network Entity Resolution","authors":"Cong Shi, Rong Duan","doi":"10.1109/ICDMW.2015.94","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.94","url":null,"abstract":"Online Social Networks (OSN) are widely adopted in our daily lives, and it is common for one individual to register with multiple sites for different services. Linking the rich contents of different social network sites is valuable to researchers for understanding human behaviors from different perspectives. For instance, each OSN has its own group of users and thus, has its own biases. Linked accounts can be a good calibration dataset to improve data quality. This Entity Resolution (ER) problem is a challenge in the social network domain that many researchers attempt to tackle. In this paper we take advantage of spatial information posted in different social network sites and propose an efficient multiresolution mutual information approach to link the entities from those sites. The proposed method significantly reduces the computing time by utilizing an iterative coarse-to-fine multiresolution approach, yet is robust in dealing with the sparsity of location data. The human location-wise behavior is also discussed in deciding the resolution level. Public available Twitter and Instagram data collected from their APIs are used to illustrate the method, and the performance is evaluated by comparing it with greedy mutual information approach.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133943045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoming Li, Bin Wu, Qian Guo, Xuelin Zeng, C. Shi
Dynamic community detection algorithms try to solve problems that identify communities of dynamic network which consists of a series of network snapshots. To address this issue, here we propose a new dynamic community detection algorithm based on incremental identification according to a vertex-based metric called permanence. We incrementally analyze the community ownership of partial vertices, so as to avoid the reassignment of all the vertices in the network to their respective communities. In addition, we propose a new metrics called evolution strength to measure the error probably caused by incrementally assigning the community ownership or the abrupt change of network structure. The experiment results show that our proposed algorithm is able to identify the community structure in a network with a higher efficiency. Meanwhile, due to the lack of dynamic network data with ground-truth structure and limitation of existing synthetic methods, we propose a novel method for generating synthetic data of dynamic network with ground-truth structure, which defines evolution events and evolution rate of events, so as to get more realistic synthetic data.
{"title":"Dynamic Community Detection Algorithm Based on Incremental Identification","authors":"Xiaoming Li, Bin Wu, Qian Guo, Xuelin Zeng, C. Shi","doi":"10.1109/ICDMW.2015.158","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.158","url":null,"abstract":"Dynamic community detection algorithms try to solve problems that identify communities of dynamic network which consists of a series of network snapshots. To address this issue, here we propose a new dynamic community detection algorithm based on incremental identification according to a vertex-based metric called permanence. We incrementally analyze the community ownership of partial vertices, so as to avoid the reassignment of all the vertices in the network to their respective communities. In addition, we propose a new metrics called evolution strength to measure the error probably caused by incrementally assigning the community ownership or the abrupt change of network structure. The experiment results show that our proposed algorithm is able to identify the community structure in a network with a higher efficiency. Meanwhile, due to the lack of dynamic network data with ground-truth structure and limitation of existing synthetic methods, we propose a novel method for generating synthetic data of dynamic network with ground-truth structure, which defines evolution events and evolution rate of events, so as to get more realistic synthetic data.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"55 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132117065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ATMs aim to extend essential banking services such as cash withdrawal and deposit beyond the working hours of a bank's branch. However, ATMs incur a significant significant cost overhead in the form of capital and operational costs. The problem of ATM location is further complicated as customers of one bank can use their debit cards at any other bank's ATMs. While this might attract charges, some banks often refund these charges to attract customers. Banks need to have a mechanism to quantitatively measure the benefits of managing their own ATM versus paying for services rendered to it's customers by other banks through their ATMs. Game theory is the study of strategic decision making and is an effective technique to identify the best business strategy when provided with multiple options. In this paper we propose a game theoretic model based on stochastic games to identify the best strategy to be adopted by banks for their ATM expansion. We further propose an algorithm to identify the idle locations where a bank should place an ATM should the result of the ATM game recommend that the bank should establish it's own ATM.
{"title":"A Stochastic Game Theoretic Model for Expanding ATM Services","authors":"Raja Rathnam Naidu Kanapaka, Raghu Neelisetti","doi":"10.1109/ICDMW.2015.125","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.125","url":null,"abstract":"ATMs aim to extend essential banking services such as cash withdrawal and deposit beyond the working hours of a bank's branch. However, ATMs incur a significant significant cost overhead in the form of capital and operational costs. The problem of ATM location is further complicated as customers of one bank can use their debit cards at any other bank's ATMs. While this might attract charges, some banks often refund these charges to attract customers. Banks need to have a mechanism to quantitatively measure the benefits of managing their own ATM versus paying for services rendered to it's customers by other banks through their ATMs. Game theory is the study of strategic decision making and is an effective technique to identify the best business strategy when provided with multiple options. In this paper we propose a game theoretic model based on stochastic games to identify the best strategy to be adopted by banks for their ATM expansion. We further propose an algorithm to identify the idle locations where a bank should place an ATM should the result of the ATM game recommend that the bank should establish it's own ATM.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134502500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile devices become more and more prevalent in recent years, especially in young groups. The rapid progress of mobile devices promotes the development of M-Commerce business. The purchase on mobile terminals accounts for a considerable percentage in the total trading volume of E-Commerce and begins to draw the attention of E-Commerce corporation. Alibaba held a Mobile Recommendation Algorithm Competition aiming to recommend appropriate items for mobile users at the right time and place. The dataset provided by Alibaba consists of about 6 billion operation logs made by 5 million Taobao users towards over 150 million items spanning a period of one month. Compared with traditional scenarios in purchase predicting, the competition raised three challenges: (1)The dataset is too large to be processed in personal computers, (2)Some days with great discounts provided by Taobao Marketplace are within the period of dataset, (3)Positive samples are too few compared to the dimension of features. In this paper we study the problem of predicting the purchase behaviour of M-Commerce users, by exploring the solution for Alibaba's Mobile Recommendation Algorithm Competition. We first deeply study the habit of customers and filter many outliers. After that we adopt the method of "sliding window" to supply positive samples of training dataset and smooth the burst of sales near Dec 12th. We design a feature engineering framework to extract 6 categories of features that aim to capture the buying potential of user-item pairs. Our features exploit the interaction of user-item pair, user's shopping habit and item' attraction for users. Then we apply Gradient Boost Decision Trees (GBDT) as the training model. In the end, we combine outputs of individual GBDT together by Logistic Regression to get the final predictions. Our solution achieves 8.66% F1 score, and ranks the third place in the final round.
{"title":"Multi-Classes Feature Engineering with Sliding Window for Purchase Prediction in Mobile Commerce","authors":"Qiang Li, Maojie Gu, Keren Zhou, Xiaoming Sun","doi":"10.1109/ICDMW.2015.172","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.172","url":null,"abstract":"Mobile devices become more and more prevalent in recent years, especially in young groups. The rapid progress of mobile devices promotes the development of M-Commerce business. The purchase on mobile terminals accounts for a considerable percentage in the total trading volume of E-Commerce and begins to draw the attention of E-Commerce corporation. Alibaba held a Mobile Recommendation Algorithm Competition aiming to recommend appropriate items for mobile users at the right time and place. The dataset provided by Alibaba consists of about 6 billion operation logs made by 5 million Taobao users towards over 150 million items spanning a period of one month. Compared with traditional scenarios in purchase predicting, the competition raised three challenges: (1)The dataset is too large to be processed in personal computers, (2)Some days with great discounts provided by Taobao Marketplace are within the period of dataset, (3)Positive samples are too few compared to the dimension of features. In this paper we study the problem of predicting the purchase behaviour of M-Commerce users, by exploring the solution for Alibaba's Mobile Recommendation Algorithm Competition. We first deeply study the habit of customers and filter many outliers. After that we adopt the method of \"sliding window\" to supply positive samples of training dataset and smooth the burst of sales near Dec 12th. We design a feature engineering framework to extract 6 categories of features that aim to capture the buying potential of user-item pairs. Our features exploit the interaction of user-item pair, user's shopping habit and item' attraction for users. Then we apply Gradient Boost Decision Trees (GBDT) as the training model. In the end, we combine outputs of individual GBDT together by Logistic Regression to get the final predictions. Our solution achieves 8.66% F1 score, and ranks the third place in the final round.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134567183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaojing Zhang, R. Wang, Weihong Zheng, Shizhan Lan, D. Liang, Hao Jin
Confronted with fierce competition, an increasing number of telecommunication companies in China realize that they can increase proflts by reducing the rate of customer churn rather than attracting the same number of new customers. Recently, the availability of big data has increased, which has stimulated the development of data mining techniques. Identifying methods by which to maximize proflts is vital for operators based on big data. Novelly, this paper studies three key factors of the customer churn problem, namely, churn rate, prediction performance, and retention capability. We propose a proflt function that maximizes proflts under different conditions and obtain favorable results in applying it to sample data from China Mobile Communications Corporation. Theoretically, about 7.72 million Chinese Yuan per month can be obtained by applying proposed model to China Mobile Group Guangxi Company Limited, making our research of great economic value.
{"title":"Profit Maximization Analysis Based on Data Mining and the Exponential Retention Model Assumption with Respect to Customer Churn Problems","authors":"Zhaojing Zhang, R. Wang, Weihong Zheng, Shizhan Lan, D. Liang, Hao Jin","doi":"10.1109/ICDMW.2015.84","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.84","url":null,"abstract":"Confronted with fierce competition, an increasing number of telecommunication companies in China realize that they can increase proflts by reducing the rate of customer churn rather than attracting the same number of new customers. Recently, the availability of big data has increased, which has stimulated the development of data mining techniques. Identifying methods by which to maximize proflts is vital for operators based on big data. Novelly, this paper studies three key factors of the customer churn problem, namely, churn rate, prediction performance, and retention capability. We propose a proflt function that maximizes proflts under different conditions and obtain favorable results in applying it to sample data from China Mobile Communications Corporation. Theoretically, about 7.72 million Chinese Yuan per month can be obtained by applying proposed model to China Mobile Group Guangxi Company Limited, making our research of great economic value.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134602293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-world social networks contain relationships of multiple different types, but this richness is often ignored in graph-theoretic modelling. We show how two recently developed spectral embedding techniques, for directed graphs (relationships are asymmetric) and for signed graphs (relationships are both positive and negative), can be combined. This combination is particularly appropriate for intelligence, terrorism, and law-enforcement applications. We illustrate by applying the novel embedding technique to datasets describing conflict in North-West Africa, and show how unusual interactions can be identified.
{"title":"Signed Directed Social Network Analysis Applied to Group Conflict","authors":"Q. Zheng, D. Skillicorn, O. Walther","doi":"10.1109/ICDMW.2015.107","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.107","url":null,"abstract":"Real-world social networks contain relationships of multiple different types, but this richness is often ignored in graph-theoretic modelling. We show how two recently developed spectral embedding techniques, for directed graphs (relationships are asymmetric) and for signed graphs (relationships are both positive and negative), can be combined. This combination is particularly appropriate for intelligence, terrorism, and law-enforcement applications. We illustrate by applying the novel embedding technique to datasets describing conflict in North-West Africa, and show how unusual interactions can be identified.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134066326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growth of user generated contents (UGC), it is important to know consumers' opinions about features or deficiencies of products quickly. Such information is important not only for companies, but also for consumers. Keyword-based visualization and clustering are effective methods to observe summary of opinions. In order to decrease users' effort in examining vast amount of UGC, we proposed an interactive visualization system that presents sentiment words with aspects based on natural language processing and sentiment lexicon. This paper also proposes to apply latent Dirichlet allocation (LDA) to cluster reviews into several topics in order to improve understandability of visualization. This paper explains the developed system with case studies.
{"title":"Proposal of LDA-Based Sentiment Visualization of Hotel Reviews","authors":"Yu-Sheng Chen, Lieu-Hen Chen, Y. Takama","doi":"10.1109/ICDMW.2015.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.72","url":null,"abstract":"With the growth of user generated contents (UGC), it is important to know consumers' opinions about features or deficiencies of products quickly. Such information is important not only for companies, but also for consumers. Keyword-based visualization and clustering are effective methods to observe summary of opinions. In order to decrease users' effort in examining vast amount of UGC, we proposed an interactive visualization system that presents sentiment words with aspects based on natural language processing and sentiment lexicon. This paper also proposes to apply latent Dirichlet allocation (LDA) to cluster reviews into several topics in order to improve understandability of visualization. This paper explains the developed system with case studies.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115618365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Searching desirable events in uncontrolled videos isa challenging task. Current researches mainly focus on obtaining concepts from numerous labeled videos. But it is time consumingand labor expensive to collect a large amount of required labeled videos to model events under various circumstances. To alleviate the labeling process, we propose to learn models for videos by leveraging abundant Web images which contains a rich source of information with many events taken under various conditions and roughly annotated. However, knowledge from the Web is noisy and diverse, brute force knowledge transfer may hurt the retrieval performance. To address such negative transfer problem, we propose a novel Joint Group Weighting Learning (JGWL) framework to leverage different but related groups of knowledge (source domain) queried from the Web image searching engine to real-world videos (target domain). Under this framework, weights of different groups are learned in a joint optimization framework, and each weight represents how contributive the corresponding image group is to the knowledge transferred to the videos. Moreover, to deal with the feature distribution mismatching between video feature space and image feature space, we build a common feature subspace to bridge these two heterogeneous feature spaces in an unsupervised manner. Experimental results on two challenging video datasets demonstrate that it is effective to use grouped knowledge gained from Web images for video retrieval.
{"title":"Finding Event Videos via Image Search Engine","authors":"Han Wang, Xinxiao Wu","doi":"10.1109/ICDMW.2015.78","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.78","url":null,"abstract":"Searching desirable events in uncontrolled videos isa challenging task. Current researches mainly focus on obtaining concepts from numerous labeled videos. But it is time consumingand labor expensive to collect a large amount of required labeled videos to model events under various circumstances. To alleviate the labeling process, we propose to learn models for videos by leveraging abundant Web images which contains a rich source of information with many events taken under various conditions and roughly annotated. However, knowledge from the Web is noisy and diverse, brute force knowledge transfer may hurt the retrieval performance. To address such negative transfer problem, we propose a novel Joint Group Weighting Learning (JGWL) framework to leverage different but related groups of knowledge (source domain) queried from the Web image searching engine to real-world videos (target domain). Under this framework, weights of different groups are learned in a joint optimization framework, and each weight represents how contributive the corresponding image group is to the knowledge transferred to the videos. Moreover, to deal with the feature distribution mismatching between video feature space and image feature space, we build a common feature subspace to bridge these two heterogeneous feature spaces in an unsupervised manner. Experimental results on two challenging video datasets demonstrate that it is effective to use grouped knowledge gained from Web images for video retrieval.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115105087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Top-k queries are widely studied for identifying a ranked set of the k most interesting objects based on the individual user preference. Reverse top-k queries are proposed from the perspective of the product manufacturer, which are essential for manufacturers to assess the potential market and impacts of their products. However, the existing approaches for reverse top-k queries are all based on the assumption that the underlying data are exact. Due to the intrinsic differences between uncertain and certain data, these methods are designed only in certain databases and cannot be applied to uncertain case directly. Motivated by this, in this paper, we firstly model the probabilistic reverse top-k queries in the context of uncertain data. Moreover, we formulate the challenging problem of processing queries that report l most favorite objects to users, where impact factor of an object is defined as the cardinality of the probabilistic reverse top-k query result set. For speeding up the query, we exploit several properties of probabilistic threshold top-k queries and probabilistic skyline queries to reduce the solution space of this problem. In addition, an upper bound of the potential users is estimated to reduce the cost of computing the probabilistic reverse top-k queries for the candidate objects. Furthermore, effective pruning heuristics are presented to further reduce the search space of query processing. Finally, efficient query algorithms are presented seamlessly with integration of the proposed pruning strategies. Extensive experiments demonstrate the efficiency and effectiveness of our proposed algorithms with various experimental settings.
{"title":"Reporting L Most Favorite Objects in Uncertain Databases with Probabilistic Reverse Top-k Queries","authors":"Guoqing Xiao, Kenli Li, Keqin Li","doi":"10.1109/ICDMW.2015.47","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.47","url":null,"abstract":"Top-k queries are widely studied for identifying a ranked set of the k most interesting objects based on the individual user preference. Reverse top-k queries are proposed from the perspective of the product manufacturer, which are essential for manufacturers to assess the potential market and impacts of their products. However, the existing approaches for reverse top-k queries are all based on the assumption that the underlying data are exact. Due to the intrinsic differences between uncertain and certain data, these methods are designed only in certain databases and cannot be applied to uncertain case directly. Motivated by this, in this paper, we firstly model the probabilistic reverse top-k queries in the context of uncertain data. Moreover, we formulate the challenging problem of processing queries that report l most favorite objects to users, where impact factor of an object is defined as the cardinality of the probabilistic reverse top-k query result set. For speeding up the query, we exploit several properties of probabilistic threshold top-k queries and probabilistic skyline queries to reduce the solution space of this problem. In addition, an upper bound of the potential users is estimated to reduce the cost of computing the probabilistic reverse top-k queries for the candidate objects. Furthermore, effective pruning heuristics are presented to further reduce the search space of query processing. Finally, efficient query algorithms are presented seamlessly with integration of the proposed pruning strategies. Extensive experiments demonstrate the efficiency and effectiveness of our proposed algorithms with various experimental settings.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}