Selecting an informative subset of features has important applications in many data mining tasks especially for high-dimensional data. Recently, simultaneous selection of features and feature groups (a.k.a bi-level selection) becomes increasingly popular since it not only reduces the number of features but also unveils the underlying grouping effect in the data, which is a valuable functionality in many applications such as bioinformatics and web data mining. One major challenge of bi-level selection (or even feature selection only) is that computing a globally optimal solution requires a prohibitive computational cost. To overcome such a challenge, current research mainly falls into two categories. The first one focuses on finding suitable continuous computational surrogates for the discrete functions and this leads to various convex and nonconvex optimization models. Although efficient, convex models usually deliver sub-optimal performance while nonconvex models on the other hand require significantly more computational effort. Another direction is to use greedy algorithms to solve the discrete optimization directly. However, existing algorithms are proposed to handle single-level selection only and it remains challenging to extend these methods to handle bi-level selection. In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. Our main contributions are: (1) we propose a novel bi-level selection model and show that the key combinatorial problem admits a globally optimal solution using dynamic programming; (2) we provide an error bound between our solution and the globally optimal under the RIP (Restricted Isometry Property) theoretical framework. Our experiments on synthetic and real data demonstrate that the proposed algorithm produces encouraging performance while keeping comparable computational efficiency to convex relaxation models.
{"title":"Simultaneous feature and feature group selection through hard thresholding","authors":"Shuo Xiang, Tao Yang, Jieping Ye","doi":"10.1145/2623330.2623662","DOIUrl":"https://doi.org/10.1145/2623330.2623662","url":null,"abstract":"Selecting an informative subset of features has important applications in many data mining tasks especially for high-dimensional data. Recently, simultaneous selection of features and feature groups (a.k.a bi-level selection) becomes increasingly popular since it not only reduces the number of features but also unveils the underlying grouping effect in the data, which is a valuable functionality in many applications such as bioinformatics and web data mining. One major challenge of bi-level selection (or even feature selection only) is that computing a globally optimal solution requires a prohibitive computational cost. To overcome such a challenge, current research mainly falls into two categories. The first one focuses on finding suitable continuous computational surrogates for the discrete functions and this leads to various convex and nonconvex optimization models. Although efficient, convex models usually deliver sub-optimal performance while nonconvex models on the other hand require significantly more computational effort. Another direction is to use greedy algorithms to solve the discrete optimization directly. However, existing algorithms are proposed to handle single-level selection only and it remains challenging to extend these methods to handle bi-level selection. In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. Our main contributions are: (1) we propose a novel bi-level selection model and show that the key combinatorial problem admits a globally optimal solution using dynamic programming; (2) we provide an error bound between our solution and the globally optimal under the RIP (Restricted Isometry Property) theoretical framework. Our experiments on synthetic and real data demonstrate that the proposed algorithm produces encouraging performance while keeping comparable computational efficiency to convex relaxation models.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"226 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83437070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuan Song, Quanshi Zhang, Y. Sekimoto, R. Shibasaki
The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, accurately predicting human emergency behavior and their mobility will become the critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. Based on our empirical analysis through these data, we find that human behavior and their mobility following large-scale disaster sometimes correlate with their mobility patterns during normal times, and are also highly impacted by their social relationship, intensity of disaster, damage level, government appointed shelters, news reporting, large population flow and etc. On the basis of these findings, we develop a model of human behavior that takes into account these factors for accurately predicting human emergency behavior and their mobility following large-scale disaster. The experimental results and validations demonstrate the efficiency of our behavior model, and suggest that human behavior and their movements during disasters may be significantly more predictable than previously thought.
{"title":"Prediction of human emergency behavior and their mobility following large-scale disaster","authors":"Xuan Song, Quanshi Zhang, Y. Sekimoto, R. Shibasaki","doi":"10.1145/2623330.2623628","DOIUrl":"https://doi.org/10.1145/2623330.2623628","url":null,"abstract":"The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, accurately predicting human emergency behavior and their mobility will become the critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. Based on our empirical analysis through these data, we find that human behavior and their mobility following large-scale disaster sometimes correlate with their mobility patterns during normal times, and are also highly impacted by their social relationship, intensity of disaster, damage level, government appointed shelters, news reporting, large population flow and etc. On the basis of these findings, we develop a model of human behavior that takes into account these factors for accurately predicting human emergency behavior and their mobility following large-scale disaster. The experimental results and validations demonstrate the efficiency of our behavior model, and suggest that human behavior and their movements during disasters may be significantly more predictable than previously thought.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90196617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as 'speed matters,' but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD's industrial track.
{"title":"Seven rules of thumb for web site experimenters","authors":"Ron Kohavi, Alex Deng, R. Longbotham, Ya Xu","doi":"10.1145/2623330.2623341","DOIUrl":"https://doi.org/10.1145/2623330.2623341","url":null,"abstract":"Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as 'speed matters,' but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD's industrial track.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89750134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Herodotou, Bolin Ding, S. Balakrishnan, G. Outhred, Percy Fitter
Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.
{"title":"Scalable near real-time failure localization of data center networks","authors":"H. Herodotou, Bolin Ding, S. Balakrishnan, G. Outhred, Percy Fitter","doi":"10.1145/2623330.2623365","DOIUrl":"https://doi.org/10.1145/2623330.2623365","url":null,"abstract":"Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89914986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions. First, can we identify the unknown mapping between the users and/or items? Second, can we further utilize the identified mappings to improve the quality of recommendation in either domain? Our solution integrates a latent space matching procedure and a refining process based on the optimization of prediction to identify the matching. Then, we further design a transfer-based method to improve the recommendation performance. Using both synthetic and real data, we have done extensive experiments given different real life scenarios to verify the effectiveness of our models. The code and other materials are available at http://www.csie.ntu.edu.tw/~r00922051/matching/
{"title":"Matching users and items across domains to improve the recommendation quality","authors":"Chung-Yi Li, Shou-de Lin","doi":"10.1145/2623330.2623657","DOIUrl":"https://doi.org/10.1145/2623330.2623657","url":null,"abstract":"Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions. First, can we identify the unknown mapping between the users and/or items? Second, can we further utilize the identified mappings to improve the quality of recommendation in either domain? Our solution integrates a latent space matching procedure and a refining process based on the optimization of prediction to identify the matching. Then, we further design a transfer-based method to improve the recommendation performance. Using both synthetic and real data, we have done extensive experiments given different real life scenarios to verify the effectiveness of our models. The code and other materials are available at http://www.csie.ntu.edu.tw/~r00922051/matching/","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89963224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as net worth of zuckerburg, battery life of ipad, half life of plutonium, and calories in pizza. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early "hardening" of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.
{"title":"Open-domain quantity queries on web tables: annotation, response, and consensus models","authors":"Sunita Sarawagi, Soumen Chakrabarti","doi":"10.1145/2623330.2623749","DOIUrl":"https://doi.org/10.1145/2623330.2623749","url":null,"abstract":"Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as net worth of zuckerburg, battery life of ipad, half life of plutonium, and calories in pizza. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early \"hardening\" of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87467140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep societal benefits will spring from advances in data availability and in computational procedures for mining insights and inferences from large data sets. I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. I will start with efforts to learn and field predictive models that forecast flows of traffic in greater city regions. Moving from the ground to the air, I will discuss fusing data from aircraft to make inferences about atmospheric conditions and using these results to enhance air transport. I will then focus on experiences with building and fielding predictive models in clinical medicine. I will show how inferences about outcomes and interventions can provide insights and guide decision making. Moving beyond data captured by hospitals, I will discuss the promise of transforming anonymized behavioral data drawn from web services into large-scale sensor networks for public health, including efforts to identify adverse effects of medications and to understand illness in populations. I will conclude by describing how we can use machine learning to leverage the complementarity of human and machine intellect to solve challenging problems in science and society.
{"title":"Data, predictions, and decisions in support of people and society","authors":"E. Horvitz","doi":"10.1145/2623330.2630815","DOIUrl":"https://doi.org/10.1145/2623330.2630815","url":null,"abstract":"Deep societal benefits will spring from advances in data availability and in computational procedures for mining insights and inferences from large data sets. I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. I will start with efforts to learn and field predictive models that forecast flows of traffic in greater city regions. Moving from the ground to the air, I will discuss fusing data from aircraft to make inferences about atmospheric conditions and using these results to enhance air transport. I will then focus on experiences with building and fielding predictive models in clinical medicine. I will show how inferences about outcomes and interventions can provide insights and guide decision making. Moving beyond data captured by hospitals, I will discuss the promise of transforming anonymized behavioral data drawn from web services into large-scale sensor networks for public health, including efforts to identify adverse effects of medications and to understand illness in populations. I will conclude by describing how we can use machine learning to leverage the complementarity of human and machine intellect to solve challenging problems in science and society.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"122 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76148537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional data mining techniques are designed to model a single type of heterogeneity, such as multi-task learning for modeling task heterogeneity, multi-view learning for modeling view heterogeneity, etc. Recently, a variety of real applications emerged, which exhibit dual heterogeneity, namely both task heterogeneity and view heterogeneity. Examples include insider threat detection across multiple organizations, web image classification in different domains, etc. Existing methods for addressing such problems typically assume that multiple tasks are equally related and multiple views are equally consistent, which limits their application in complex settings with varying task relatedness and view consistency. In this paper, we advance state-of-the-art techniques by adaptively modeling task relatedness and view consistency via a nonparametric Bayes model: we model task relatedness using normal penalty with sparse covariances, and view consistency using matrix Dirichlet process. Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. Experimental results on multiple real data sets demonstrate the effectiveness of the proposed algorithm.
{"title":"Learning with dual heterogeneity: a nonparametric bayes model","authors":"Hongxia Yang, Jingrui He","doi":"10.1145/2623330.2623727","DOIUrl":"https://doi.org/10.1145/2623330.2623727","url":null,"abstract":"Traditional data mining techniques are designed to model a single type of heterogeneity, such as multi-task learning for modeling task heterogeneity, multi-view learning for modeling view heterogeneity, etc. Recently, a variety of real applications emerged, which exhibit dual heterogeneity, namely both task heterogeneity and view heterogeneity. Examples include insider threat detection across multiple organizations, web image classification in different domains, etc. Existing methods for addressing such problems typically assume that multiple tasks are equally related and multiple views are equally consistent, which limits their application in complex settings with varying task relatedness and view consistency. In this paper, we advance state-of-the-art techniques by adaptively modeling task relatedness and view consistency via a nonparametric Bayes model: we model task relatedness using normal penalty with sparse covariances, and view consistency using matrix Dirichlet process. Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. Experimental results on multiple real data sets demonstrate the effectiveness of the proposed algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"141 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77518352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent blossom of social network and communication services in both public and corporate settings have generated a staggering amount of network data of all kinds. Unlike the bio-networks and the chemical compound graph data often used in traditional network mining and analysis, the new network data grown out of the social applications are characterized by their rich attributes, high heterogeneity, enormous sizes and complex patterns of various semantic meanings, all of which have posed significant research challenges to the graph/network mining community. In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern. We would present the problem settings, the challenges, the recent research advances and some future directions for each perspective. Topics include but are not limited to correlation mining, iceberg finding, anomaly detection, relationship discovery, information flow, task routing, and pattern mining.
{"title":"Network mining and analysis for social applications","authors":"Feida Zhu, Huan Sun, Xifeng Yan","doi":"10.1145/2623330.2630810","DOIUrl":"https://doi.org/10.1145/2623330.2630810","url":null,"abstract":"The recent blossom of social network and communication services in both public and corporate settings have generated a staggering amount of network data of all kinds. Unlike the bio-networks and the chemical compound graph data often used in traditional network mining and analysis, the new network data grown out of the social applications are characterized by their rich attributes, high heterogeneity, enormous sizes and complex patterns of various semantic meanings, all of which have posed significant research challenges to the graph/network mining community. In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern. We would present the problem settings, the challenges, the recent research advances and some future directions for each perspective. Topics include but are not limited to correlation mining, iceberg finding, anomaly detection, relationship discovery, information flow, task routing, and pattern mining.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"99 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86115235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The GPS technology and new forms of urban geography have changed the paradigm for mobile services. As such, the abundant availability of GPS traces has enabled new ways of doing taxi business. Indeed, recent efforts have been made on developing mobile recommender systems for taxi drivers using Taxi GPS traces. These systems can recommend a sequence of pick-up points for the purpose of maximizing the probability of identifying a customer with the shortest driving distance. However, in the real world, the income of taxi drivers is strongly correlated with the effective driving hours. In other words, it is more critical for taxi drivers to know the actual driving routes to minimize the driving time before finding a customer. To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. The design goal is to maximize their profits when following the recommended routes for finding passengers. Specifically, we first design a net profit objective function for evaluating the potential profits of the driving routes. Then, we develop a graph representation of road networks by mining the historical taxi GPS traces and provide a Brute-Force strategy to generate optimal driving route for recommendation. However, a critical challenge along this line is the high computational cost of the graph based approach. Therefore, we develop a novel recursion strategy based on the special form of the net profit function for searching optimal candidate routes efficiently. Particularly, instead of recommending a sequence of pick-up points and letting the driver decide how to get to those points, our recommender system is capable of providing an entire driving route, and the drivers are able to find a customer for the largest potential profit by following the recommendations. This makes our recommender system more practical and profitable than other existing recommender systems. Finally, we carry out extensive experiments on a real-world data set collected from the San Francisco Bay area and the experimental results clearly validate the effectiveness of the proposed recommender system.
{"title":"A cost-effective recommender system for taxi drivers","authors":"Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, Hui Xiong","doi":"10.1145/2623330.2623668","DOIUrl":"https://doi.org/10.1145/2623330.2623668","url":null,"abstract":"The GPS technology and new forms of urban geography have changed the paradigm for mobile services. As such, the abundant availability of GPS traces has enabled new ways of doing taxi business. Indeed, recent efforts have been made on developing mobile recommender systems for taxi drivers using Taxi GPS traces. These systems can recommend a sequence of pick-up points for the purpose of maximizing the probability of identifying a customer with the shortest driving distance. However, in the real world, the income of taxi drivers is strongly correlated with the effective driving hours. In other words, it is more critical for taxi drivers to know the actual driving routes to minimize the driving time before finding a customer. To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. The design goal is to maximize their profits when following the recommended routes for finding passengers. Specifically, we first design a net profit objective function for evaluating the potential profits of the driving routes. Then, we develop a graph representation of road networks by mining the historical taxi GPS traces and provide a Brute-Force strategy to generate optimal driving route for recommendation. However, a critical challenge along this line is the high computational cost of the graph based approach. Therefore, we develop a novel recursion strategy based on the special form of the net profit function for searching optimal candidate routes efficiently. Particularly, instead of recommending a sequence of pick-up points and letting the driver decide how to get to those points, our recommender system is capable of providing an entire driving route, and the drivers are able to find a customer for the largest potential profit by following the recommendations. This makes our recommender system more practical and profitable than other existing recommender systems. Finally, we carry out extensive experiments on a real-world data set collected from the San Francisco Bay area and the experimental results clearly validate the effectiveness of the proposed recommender system.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84852259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}