The effective analysis of social networks and graph-structured data is often limited by the privacy concerns of individuals whose data make up these networks. Differential privacy offers individuals a rigorous and appealing guarantee of privacy. But while differentially private algorithms for computing basic graph properties have been proposed, most graph modeling tasks common in the data mining community cannot yet be carried out privately. In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). We break the estimation problem into two steps: computing private sufficient statistics, then using them to estimate the model parameters. We consider specific alternating statistics that are in common use for ERGM models and describe a method for estimating them privately by adding noise proportional to a high-confidence bound on their local sensitivity. In addition, we propose an estimation algorithm that considers the noise distribution of the private statistics and offers better accuracy than performing standard parameter estimation using the private statistics.
{"title":"Exponential random graph estimation under differential privacy","authors":"Wentian Lu, G. Miklau","doi":"10.1145/2623330.2623683","DOIUrl":"https://doi.org/10.1145/2623330.2623683","url":null,"abstract":"The effective analysis of social networks and graph-structured data is often limited by the privacy concerns of individuals whose data make up these networks. Differential privacy offers individuals a rigorous and appealing guarantee of privacy. But while differentially private algorithms for computing basic graph properties have been proposed, most graph modeling tasks common in the data mining community cannot yet be carried out privately. In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). We break the estimation problem into two steps: computing private sufficient statistics, then using them to estimate the model parameters. We consider specific alternating statistics that are in common use for ERGM models and describe a method for estimating them privately by adding noise proportional to a high-confidence bound on their local sensitivity. In addition, we propose an estimation algorithm that considers the noise distribution of the private statistics and offers better accuracy than performing standard parameter estimation using the private statistics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91454742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Josif Grabocka, Nicolas Schilling, Martin Wistuba, L. Schmidt-Thieme
Shapelets are discriminative sub-sequences of time series that best predict the target variable. For this reason, shapelet discovery has recently attracted considerable interest within the time-series research community. Currently shapelets are found by evaluating the prediction qualities of numerous candidates extracted from the series segments. In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. A new mathematical formalization of the task via a classification objective function is proposed and a tailored stochastic gradient learning algorithm is applied. The proposed method enables learning near-to-optimal shapelets directly without the need to try out lots of candidates. Furthermore, our method can learn true top-K shapelets by capturing their interaction. Extensive experimentation demonstrates statistically significant improvement in terms of wins and ranks against 13 baselines over 28 time-series datasets.
{"title":"Learning time-series shapelets","authors":"Josif Grabocka, Nicolas Schilling, Martin Wistuba, L. Schmidt-Thieme","doi":"10.1145/2623330.2623613","DOIUrl":"https://doi.org/10.1145/2623330.2623613","url":null,"abstract":"Shapelets are discriminative sub-sequences of time series that best predict the target variable. For this reason, shapelet discovery has recently attracted considerable interest within the time-series research community. Currently shapelets are found by evaluating the prediction qualities of numerous candidates extracted from the series segments. In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. A new mathematical formalization of the task via a classification objective function is proposed and a tailored stochastic gradient learning algorithm is applied. The proposed method enables learning near-to-optimal shapelets directly without the need to try out lots of candidates. Furthermore, our method can learn true top-K shapelets by capturing their interaction. Extensive experimentation demonstrates statistically significant improvement in terms of wins and ranks against 13 baselines over 28 time-series datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87227391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepak Vasisht, Andreas C. Damianou, M. Varma, Ashish Kapoor
We study the problem of active learning for multilabel classification. We focus on the real-world scenario where the average number of positive (relevant) labels per data point is small leading to positive label sparsity. Carrying out mutual information based near-optimal active learning in this setting is a challenging task since the computational complexity involved is exponential in the total number of labels. We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. The benefit of this alternate inference scheme is that it enables a natural approximation of the mutual information objective. We prove that the approximation leads to an identical solution to the exact optimization problem but at a fraction of the optimization cost. This allows us to carry out efficient, non-myopic, and near-optimal active learning for sparse multilabel classification. Extensive experiments reveal the effectiveness of the method.
{"title":"Active learning for sparse bayesian multilabel classification","authors":"Deepak Vasisht, Andreas C. Damianou, M. Varma, Ashish Kapoor","doi":"10.1145/2623330.2623759","DOIUrl":"https://doi.org/10.1145/2623330.2623759","url":null,"abstract":"We study the problem of active learning for multilabel classification. We focus on the real-world scenario where the average number of positive (relevant) labels per data point is small leading to positive label sparsity. Carrying out mutual information based near-optimal active learning in this setting is a challenging task since the computational complexity involved is exponential in the total number of labels. We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. The benefit of this alternate inference scheme is that it enables a natural approximation of the mutual information objective. We prove that the approximation leads to an identical solution to the exact optimization problem but at a fraction of the optimization cost. This allows us to carry out efficient, non-myopic, and near-optimal active learning for sparse multilabel classification. Extensive experiments reveal the effectiveness of the method.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90112208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting problem because of a large number of parameters to be determined in DML. In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. Different from the previous studies that only apply dropout to training data, we apply dropout to both the learned metrics and the training data. We illustrate that application of dropout to DML is essentially equivalent to matrix norm based regularization. Compared with the standard regularization scheme in DML, dropout is advantageous in simulating the structured regularizers which have shown consistently better performance than non structured regularizers. We verify, both empirically and theoretically, that dropout is effective in regulating the learned metric to avoid the over-fitting problem. Last, we examine the idea of wrapping the dropout technique in the state-of-art DML methods and observe that the dropout technique can significantly improve the performance of the original DML methods.
{"title":"Distance metric learning using dropout: a structured regularization approach","authors":"Qi Qian, Juhua Hu, Rong Jin, J. Pei, Shenghuo Zhu","doi":"10.1145/2623330.2623678","DOIUrl":"https://doi.org/10.1145/2623330.2623678","url":null,"abstract":"Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting problem because of a large number of parameters to be determined in DML. In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. Different from the previous studies that only apply dropout to training data, we apply dropout to both the learned metrics and the training data. We illustrate that application of dropout to DML is essentially equivalent to matrix norm based regularization. Compared with the standard regularization scheme in DML, dropout is advantageous in simulating the structured regularizers which have shown consistently better performance than non structured regularizers. We verify, both empirically and theoretically, that dropout is effective in regulating the learned metric to avoid the over-fitting problem. Last, we examine the idea of wrapping the dropout technique in the state-of-art DML methods and observe that the dropout technique can significantly improve the performance of the original DML methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87173813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu
This paper instantly infers the gas consumption and pollution emission of vehicles traveling on a city's road network in a current time slot, using GPS trajectories from a sample of vehicles (e.g., taxicabs). The knowledge can be used to suggest cost-efficient driving routes as well as identifying road segments where gas has been wasted significantly. The instant estimation of the emissions from vehicles can enable pollution alerts and help diagnose the root cause of air pollution in the long run. In our method, we first compute the travel speed of each road segment using the GPS trajectories received recently. As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. TSE leverages features learned from other data sources, e.g., map data and historical trajectories, to deal with the data sparsity problem. We then propose a Traffic Volume Inference (TVI) model to infer the number of vehicles passing each road segment per minute. TVI is an unsupervised Bayesian Network that incorporates multiple factors, such as travel speed, weather conditions and geographical features of a road. Given the travel speed and traffic volume of a road segment, gas consumption and emissions can be calculated based on existing environmental theories. We evaluate our method based on extensive experiments using GPS trajectories generated by over 32,000 taxis in Beijing over a period of two months. The results demonstrate the advantages of our method over baselines, validating the contribution of its components and finding interesting discoveries for the benefit of society.
{"title":"Inferring gas consumption and pollution emission of vehicles throughout a city","authors":"Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu","doi":"10.1145/2623330.2623653","DOIUrl":"https://doi.org/10.1145/2623330.2623653","url":null,"abstract":"This paper instantly infers the gas consumption and pollution emission of vehicles traveling on a city's road network in a current time slot, using GPS trajectories from a sample of vehicles (e.g., taxicabs). The knowledge can be used to suggest cost-efficient driving routes as well as identifying road segments where gas has been wasted significantly. The instant estimation of the emissions from vehicles can enable pollution alerts and help diagnose the root cause of air pollution in the long run. In our method, we first compute the travel speed of each road segment using the GPS trajectories received recently. As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. TSE leverages features learned from other data sources, e.g., map data and historical trajectories, to deal with the data sparsity problem. We then propose a Traffic Volume Inference (TVI) model to infer the number of vehicles passing each road segment per minute. TVI is an unsupervised Bayesian Network that incorporates multiple factors, such as travel speed, weather conditions and geographical features of a road. Given the travel speed and traffic volume of a road segment, gas consumption and emissions can be calculated based on existing environmental theories. We evaluate our method based on extensive experiments using GPS trajectories generated by over 32,000 taxis in Beijing over a period of two months. The results demonstrate the advantages of our method over baselines, validating the contribution of its components and finding interesting discoveries for the benefit of society.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75219906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated content varies greatly in quality, which calls for a principled method to identify a set of authorities, who created high-quality resources, from a massive number of contributors of content. Since most previous studies only infer global authoritativeness of a user, there is no way to differentiate the authoritativeness in different aspects of life (topics). In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. This model jointly leverages the usage data collected from the sharing log and the favorite log. The parameters in TAA are learned from a constructed training dataset, for which a novel logistic likelihood function is specifically designed. To perform Bayesian inference for TAA with the new logistic likelihood, we extend typical Gibbs sampling by introducing auxiliary variables. Thorough experiments with two real-world datasets demonstrate the effectiveness of TAA in topic-specific authority identification as well as the generalizability of the TAA generative model.
{"title":"Who are experts specializing in landscape photography?: analyzing topic-specific authority on content sharing services","authors":"Bin Bi, B. Kao, Chang Wan, Junghoo Cho","doi":"10.1145/2623330.2623752","DOIUrl":"https://doi.org/10.1145/2623330.2623752","url":null,"abstract":"With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated content varies greatly in quality, which calls for a principled method to identify a set of authorities, who created high-quality resources, from a massive number of contributors of content. Since most previous studies only infer global authoritativeness of a user, there is no way to differentiate the authoritativeness in different aspects of life (topics). In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. This model jointly leverages the usage data collected from the sharing log and the favorite log. The parameters in TAA are learned from a constructed training dataset, for which a novel logistic likelihood function is specifically designed. To perform Bayesian inference for TAA with the new logistic likelihood, we extend typical Gibbs sampling by introducing auxiliary variables. Thorough experiments with two real-world datasets demonstrate the effectiveness of TAA in topic-specific authority identification as well as the generalizability of the TAA generative model.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75781498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drift in data streams, PCA and statistics based heuristic approaches are employed. SyncStream has several attractive benefits: (a) It is capable of dynamically modeling evolving concepts from even a small set of prototypes and is robust against noisy examples. (b) Owing to synchronization-based constrained clustering and the P-Tree, it supports an efficient and effective data representation and maintenance. (c) Gradual and abrupt concept drift can be effectively detected. Empirical results shows that our method achieves good predictive performance compared to state-of-the-art algorithms and that it requires much less time than another instance-based stream mining algorithm.
{"title":"Prototype-based learning on concept-drifting data streams","authors":"Junming Shao, Zahra Ahmadi, Stefan Kramer","doi":"10.1145/2623330.2623609","DOIUrl":"https://doi.org/10.1145/2623330.2623609","url":null,"abstract":"Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drift in data streams, PCA and statistics based heuristic approaches are employed. SyncStream has several attractive benefits: (a) It is capable of dynamically modeling evolving concepts from even a small set of prototypes and is robust against noisy examples. (b) Owing to synchronization-based constrained clustering and the P-Tree, it supports an efficient and effective data representation and maintenance. (c) Gradual and abrupt concept drift can be effectively detected. Empirical results shows that our method achieves good predictive performance compared to state-of-the-art algorithms and that it requires much less time than another instance-based stream mining algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74773761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Lu, Stratis Ioannidis, Smriti Bhagat, L. Lakshmanan
People's interests are dynamically evolving, often affected by external factors such as trends promoted by the media or adopted by their friends. In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user's interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. In the latter case, we model user reactions through either attraction or aversion towards past suggestions. We study this interest evolution process, and the utility accrued by recommendations, as a function of the system's recommendation strategy. We show that, in steady state, the optimal strategy can be computed as the solution of a semi-definite program (SDP). Using datasets of user ratings, we provide evidence for the existence of aversion and attraction in real-life data, and show that our optimal strategy can lead to significantly improved recommendations over systems that ignore aversion and attraction.
{"title":"Optimal recommendations under attraction, aversion, and social influence","authors":"Wei Lu, Stratis Ioannidis, Smriti Bhagat, L. Lakshmanan","doi":"10.1145/2623330.2623744","DOIUrl":"https://doi.org/10.1145/2623330.2623744","url":null,"abstract":"People's interests are dynamically evolving, often affected by external factors such as trends promoted by the media or adopted by their friends. In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user's interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. In the latter case, we model user reactions through either attraction or aversion towards past suggestions. We study this interest evolution process, and the utility accrued by recommendations, as a function of the system's recommendation strategy. We show that, in steady state, the optimal strategy can be computed as the solution of a semi-definite program (SDP). Using datasets of user ratings, we provide evidence for the existence of aversion and attraction in real-life data, and show that our optimal strategy can lead to significantly improved recommendations over systems that ignore aversion and attraction.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74349118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.
{"title":"Integrating spreadsheet data via accurate and low-effort extraction","authors":"Zhe Chen, Michael J. Cafarella","doi":"10.1145/2623330.2623617","DOIUrl":"https://doi.org/10.1145/2623330.2623617","url":null,"abstract":"Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74594063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Derek Lin, R. Raghu, V. Ramamurthy, Jin Yu, R. Radhakrishnan, Joseph Fernandez
Large enterprise IT (Information Technology) infrastructure components generate large volumes of alerts and incident tickets. These are manually screened, but it is otherwise difficult to extract information automatically from them to gain insights in order to improve operational efficiency. We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. This would be a step towards eliminating manual classification of the alerts and incidents, which is very labor intense and costly. Our framework can handle the semi-structured text in alerts generated by IT infrastructure components such as storage devices, network devices, servers etc., as well as the unstructured text in incident tickets created manually by operations support personnel. After text pre-processing and application of appropriate distance metrics, we apply different graph-theoretic approaches to cluster the alerts and incident tickets, based on their semi-structured and unstructured text respectively. For automated interpretation and read-ability on semi-structured text clusters, we propose a method to visualize clusters that preserves the structure and human-readability of the text data as compared to traditional word clouds where the text structure is not preserved; for unstructured text clusters, we find a simple way to define prototypes of clusters for easy interpretation. This framework for clustering and visualization will enable enterprises to prioritize the issues in their IT infrastructure and improve the reliability and availability of their services.
{"title":"Unveiling clusters of events for alert and incident management in large-scale enterprise it","authors":"Derek Lin, R. Raghu, V. Ramamurthy, Jin Yu, R. Radhakrishnan, Joseph Fernandez","doi":"10.1145/2623330.2623360","DOIUrl":"https://doi.org/10.1145/2623330.2623360","url":null,"abstract":"Large enterprise IT (Information Technology) infrastructure components generate large volumes of alerts and incident tickets. These are manually screened, but it is otherwise difficult to extract information automatically from them to gain insights in order to improve operational efficiency. We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. This would be a step towards eliminating manual classification of the alerts and incidents, which is very labor intense and costly. Our framework can handle the semi-structured text in alerts generated by IT infrastructure components such as storage devices, network devices, servers etc., as well as the unstructured text in incident tickets created manually by operations support personnel. After text pre-processing and application of appropriate distance metrics, we apply different graph-theoretic approaches to cluster the alerts and incident tickets, based on their semi-structured and unstructured text respectively. For automated interpretation and read-ability on semi-structured text clusters, we propose a method to visualize clusters that preserves the structure and human-readability of the text data as compared to traditional word clouds where the text structure is not preserved; for unstructured text clusters, we find a simple way to define prototypes of clusters for easy interpretation. This framework for clustering and visualization will enable enterprises to prioritize the issues in their IT infrastructure and improve the reliability and availability of their services.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73818036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}