Frequent itemset mining is a well studied and important problem in the datamining community. An abundance of different mining algorithms exists, all with different flavor and characteristics, but almost all suffer from two major shortcomings. First, in general frequent itemset mining algorithms perform exhaustive search over a huge pattern space. Second, most algorithms assume that the input data fits into main memory. The first problem was recently tackled in the work of [2], by direct sampling the required number of patterns over the pattern space. This paper extends the direct sampling approach by casting the algorithm into the MapReduce framework, effectively ceasing the memory requirements that the data should fit into main memory. The results show that the algorithm scales well for large data sets, while the memory requirements are solely dependent on the required number of patterns in the output.
{"title":"Direct out-of-memory distributed parallel frequent pattern mining","authors":"Z. Rong, J. D. Knijf","doi":"10.1145/2501221.2501229","DOIUrl":"https://doi.org/10.1145/2501221.2501229","url":null,"abstract":"Frequent itemset mining is a well studied and important problem in the datamining community. An abundance of different mining algorithms exists, all with different flavor and characteristics, but almost all suffer from two major shortcomings. First, in general frequent itemset mining algorithms perform exhaustive search over a huge pattern space. Second, most algorithms assume that the input data fits into main memory. The first problem was recently tackled in the work of [2], by direct sampling the required number of patterns over the pattern space. This paper extends the direct sampling approach by casting the algorithm into the MapReduce framework, effectively ceasing the memory requirements that the data should fit into main memory. The results show that the algorithm scales well for large data sets, while the memory requirements are solely dependent on the required number of patterns in the output.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124404214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Zulkernine, Patrick Martin, W. Powley, S. Soltani, Serge Mankovskii, Mark Addleman
Log files provide important information for troubleshooting complex systems. However, the structure and contents of the log data and messages vary widely. For automated processing, it is necessary to first understand the layout and the structure of the data, which becomes very challenging when a massive amount of data and messages are reported by different system components in the same log file. Existing approaches apply supervised mining techniques and return frequent patterns only for single line messages. We present CAPRI (type-CAsted Pattern and Rule mIner), which uses a novel pattern mining algorithm to efficiently mine structural line patterns from semi-structured multi-line log messages. It discovers line patterns in a type-casted format; categorizes all data lines; identifies frequent, rare and interesting line patterns, and uses unsupervised learning and incremental mining techniques. It also mines association rules to identify the contextual relationship between two successive line patterns. In addition, CAPRI lists the frequent term and value patterns given the minimum support thresholds. The line and term pattern information can be applied in the next stage to categorize and reformat multi-line data, extract variables from the messages, and discover further correlation among messages for troubleshooting complex systems. To evaluate our approach, we present a comparative study of our tool against some of the existing popular open-source research tools using three different layouts of log data including a complex multi-line log file from the z/OS mainframe system.
日志文件提供了排除复杂系统故障的重要信息。但是,日志数据和消息的结构和内容差别很大。对于自动化处理,有必要首先了解数据的布局和结构,当不同的系统组件在同一日志文件中报告大量数据和消息时,这变得非常具有挑战性。现有的方法采用监督挖掘技术,只返回单行消息的频繁模式。我们提出了CAPRI (type- cast Pattern and Rule mIner),它使用一种新颖的模式挖掘算法从半结构化的多行日志消息中有效地挖掘结构化的行模式。它发现类型转换格式的行模式;对所有数据线进行分类;识别频繁、罕见和有趣的线条模式,并使用无监督学习和增量挖掘技术。它还挖掘关联规则来识别两个连续的行模式之间的上下文关系。此外,CAPRI列出了给出最小支持阈值的常用术语和价值模式。可以在下一阶段应用行和项模式信息,对多行数据进行分类和重新格式化,从消息中提取变量,并进一步发现消息之间的相关性,以便对复杂系统进行故障排除。为了评估我们的方法,我们将我们的工具与现有的一些流行的开源研究工具进行了比较研究,使用三种不同的日志数据布局,包括来自z/OS大型机系统的复杂多行日志文件。
{"title":"CAPRI: a tool for mining complex line patterns in large log data","authors":"F. Zulkernine, Patrick Martin, W. Powley, S. Soltani, Serge Mankovskii, Mark Addleman","doi":"10.1145/2501221.2501228","DOIUrl":"https://doi.org/10.1145/2501221.2501228","url":null,"abstract":"Log files provide important information for troubleshooting complex systems. However, the structure and contents of the log data and messages vary widely. For automated processing, it is necessary to first understand the layout and the structure of the data, which becomes very challenging when a massive amount of data and messages are reported by different system components in the same log file. Existing approaches apply supervised mining techniques and return frequent patterns only for single line messages. We present CAPRI (type-CAsted Pattern and Rule mIner), which uses a novel pattern mining algorithm to efficiently mine structural line patterns from semi-structured multi-line log messages. It discovers line patterns in a type-casted format; categorizes all data lines; identifies frequent, rare and interesting line patterns, and uses unsupervised learning and incremental mining techniques. It also mines association rules to identify the contextual relationship between two successive line patterns. In addition, CAPRI lists the frequent term and value patterns given the minimum support thresholds. The line and term pattern information can be applied in the next stage to categorize and reformat multi-line data, extract variables from the messages, and discover further correlation among messages for troubleshooting complex systems. To evaluate our approach, we present a comparative study of our tool against some of the existing popular open-source research tools using three different layouts of log data including a complex multi-line log file from the z/OS mainframe system.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129779915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since the Netflix $1 million Prize, announced in 2006, our company has been known to have personalization at the core of our product. Even at that point in time, the dataset that we released was considered "large", and we stirred innovation in the (Big) Data Mining research field. Our current product offering is now focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search. In this paper, we will discuss the different approaches we follow to deal with these large streams of data in order to extract information for personalizing our service. We will describe some of the machine learning models used, as well as the architectures that allow us to combine complex offline batch processes with real-time data streams.
{"title":"Big & personal: data and models behind netflix recommendations","authors":"X. Amatriain","doi":"10.1145/2501221.2501222","DOIUrl":"https://doi.org/10.1145/2501221.2501222","url":null,"abstract":"Since the Netflix $1 million Prize, announced in 2006, our company has been known to have personalization at the core of our product. Even at that point in time, the dataset that we released was considered \"large\", and we stirred innovation in the (Big) Data Mining research field. Our current product offering is now focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search.\u0000 In this paper, we will discuss the different approaches we follow to deal with these large streams of data in order to extract information for personalizing our service. We will describe some of the machine learning models used, as well as the architectures that allow us to combine complex offline batch processes with real-time data streams.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122671789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forecasting the occupancy of buildings can lead to significant improvement of smart heating and cooling systems. Using a sensor network of simple passive infrared motion sensors densely placed throughout a building, we perform data mining to forecast occupancy a short time (i.e., up to 60 minutes) into the future. Our approach is to train a set of standard forecasting models to our time series data. Each model then forecasts occupancy a various horizons into the future. We combine these forecasts using a modified Bayesian combined forecasting approach. The method is demonstrated on two large building occupancy datasets, and shows promising results for forecasting horizons of up to 60 minutes. Because the two datasets have such different occupancy profiles, we compare our algorithms on each dataset to evaluate the performance of the forecasting algorithm for the different conditions.
{"title":"Forecasting building occupancy using sensor network data","authors":"James W. Howard, W. Hoff","doi":"10.1145/2501221.2501233","DOIUrl":"https://doi.org/10.1145/2501221.2501233","url":null,"abstract":"Forecasting the occupancy of buildings can lead to significant improvement of smart heating and cooling systems. Using a sensor network of simple passive infrared motion sensors densely placed throughout a building, we perform data mining to forecast occupancy a short time (i.e., up to 60 minutes) into the future. Our approach is to train a set of standard forecasting models to our time series data. Each model then forecasts occupancy a various horizons into the future. We combine these forecasts using a modified Bayesian combined forecasting approach. The method is demonstrated on two large building occupancy datasets, and shows promising results for forecasting horizons of up to 60 minutes. Because the two datasets have such different occupancy profiles, we compare our algorithms on each dataset to evaluate the performance of the forecasting algorithm for the different conditions.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115040130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many key building design policies are made using sophisticated computer simulations such as EnergyPlus (E+), the DOE flagship whole-building energy simulation engine. E+ and other sophisticated computer simulations have several major problems. The two main issues are 1) gaps between the simulation model and the actual structure, and 2) limitations of the modeling engine's capabilities. Currently, these problems are addressed by having an engineer manually calibrate simulation parameters to real world data or using algorithmic optimization methods to adjust the building parameters. However, some simulations engines, like E+, are computationally expensive, which makes repeatedly evaluating the simulation engine costly. This work explores addressing this issue by automatically discovering the simulation's internal input and output dependencies from ~20 Gigabytes of E+ simulation data, future extensions will use ~200 Terabytes of E+ simulation data. The model is validated by inferring building parameters for E+ simulations with ground truth building parameters. Our results indicate that the model accurately represents parameter means with some deviation from the means, but does not support inferring parameter values that exist on the distribution's tail.
{"title":"Estimating building simulation parameters via Bayesian structure learning","authors":"Richard E. Edwards, J. New, L. Parker","doi":"10.1145/2501221.2501226","DOIUrl":"https://doi.org/10.1145/2501221.2501226","url":null,"abstract":"Many key building design policies are made using sophisticated computer simulations such as EnergyPlus (E+), the DOE flagship whole-building energy simulation engine. E+ and other sophisticated computer simulations have several major problems. The two main issues are 1) gaps between the simulation model and the actual structure, and 2) limitations of the modeling engine's capabilities. Currently, these problems are addressed by having an engineer manually calibrate simulation parameters to real world data or using algorithmic optimization methods to adjust the building parameters. However, some simulations engines, like E+, are computationally expensive, which makes repeatedly evaluating the simulation engine costly. This work explores addressing this issue by automatically discovering the simulation's internal input and output dependencies from ~20 Gigabytes of E+ simulation data, future extensions will use ~200 Terabytes of E+ simulation data. The model is validated by inferring building parameters for E+ simulations with ground truth building parameters. Our results indicate that the model accurately represents parameter means with some deviation from the means, but does not support inferring parameter values that exist on the distribution's tail.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"492 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123409184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Valkanas, D. Gunopulos, Ioannis Boutsis, V. Kalogeraki
The wealth of information that is readily available nowadays grants researchers and practitioners the ability to develop techniques and applications that monitor and react to all sorts of circumstances: from network congestions to natural catastrophies. Therefore, it is no longer a question of whether this can be done, but how to do it in real-time, and if possible proactively. Consequently, it becomes a necessity to develop a platform that will aggregate all the necessary information and will orchestrate it in the best way possible, towards meeting these goals. A main problem that arises in such a setting is the high diversity of the incoming data, obtained from very different sources such as sensors, smart phones, GPS signals and social networks. The large volume of the incoming data is a gift that ensures high quality of the produced output, but also a curse, because higher computational resources are needed. In this paper, we present the architecture of a framework designed to gather, aggregate and process a wide range of sensory input coming from very different sources. A distinctive characteristic of our framework is the active involvement of citizens. We guide the description of how our framework meets our requirements through two indicative use cases.
{"title":"An architecture for detecting events in real-time using massive heterogeneous data sources","authors":"G. Valkanas, D. Gunopulos, Ioannis Boutsis, V. Kalogeraki","doi":"10.1145/2501221.2501235","DOIUrl":"https://doi.org/10.1145/2501221.2501235","url":null,"abstract":"The wealth of information that is readily available nowadays grants researchers and practitioners the ability to develop techniques and applications that monitor and react to all sorts of circumstances: from network congestions to natural catastrophies. Therefore, it is no longer a question of whether this can be done, but how to do it in real-time, and if possible proactively. Consequently, it becomes a necessity to develop a platform that will aggregate all the necessary information and will orchestrate it in the best way possible, towards meeting these goals. A main problem that arises in such a setting is the high diversity of the incoming data, obtained from very different sources such as sensors, smart phones, GPS signals and social networks. The large volume of the incoming data is a gift that ensures high quality of the produced output, but also a curse, because higher computational resources are needed. In this paper, we present the architecture of a framework designed to gather, aggregate and process a wide range of sensory input coming from very different sources. A distinctive characteristic of our framework is the active involvement of citizens. We guide the description of how our framework meets our requirements through two indicative use cases.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128800873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Guo, Yanhua Yu, Meina Song, Junde Song, Yu Fu
Nowadays in many real-world scenarios, high speed data streams are usually with non-uniform misclassification costs and thus call for cost-sensitive classification algorithms of data streams. However, only little literature focuses on this issue. On the other hand, the existing algorithms for cost-sensitive classification can achieve excellent performance in the metric of total misclassification costs, but always lead to obvious reduction of accuracy, which restrains the practical application greatly. In this paper, we present an improved folk theorem. Based on the new theorem, the existing accuracy-based classification algorithm can be converted into soft cost-sensitive one immediately, which allows us to take both accuracy and cost into account. Following the idea of this theorem, the soft-CsGDT algorithm is proposed to process the data streams with non-uniform misclassification costs, which is an expansion of GDT. With both synthetic and real-world datasets, the experimental results show that compared with the cost-sensitive algorithm, the accuracy in our soft-CsGDT is significantly improved, while the total misclassification costs are approximately the same.
{"title":"Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams","authors":"Ning Guo, Yanhua Yu, Meina Song, Junde Song, Yu Fu","doi":"10.1145/2501221.2501223","DOIUrl":"https://doi.org/10.1145/2501221.2501223","url":null,"abstract":"Nowadays in many real-world scenarios, high speed data streams are usually with non-uniform misclassification costs and thus call for cost-sensitive classification algorithms of data streams. However, only little literature focuses on this issue. On the other hand, the existing algorithms for cost-sensitive classification can achieve excellent performance in the metric of total misclassification costs, but always lead to obvious reduction of accuracy, which restrains the practical application greatly. In this paper, we present an improved folk theorem. Based on the new theorem, the existing accuracy-based classification algorithm can be converted into soft cost-sensitive one immediately, which allows us to take both accuracy and cost into account. Following the idea of this theorem, the soft-CsGDT algorithm is proposed to process the data streams with non-uniform misclassification costs, which is an expansion of GDT. With both synthetic and real-world datasets, the experimental results show that compared with the cost-sensitive algorithm, the accuracy in our soft-CsGDT is significantly improved, while the total misclassification costs are approximately the same.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116298004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Engineering a city-wide ubiquitous computing system requires a comprehensive understanding of urban infrastructure including physical motorways, vehicular traffic, and human activities. Many world cities were built at different time periods and with different purposes that resulted in diversified structures and characteristics, which have to be carefully considered while designing ubiquitous computing facilities. In this paper, we propose a novel technique to study global urban infrastructure, with enabling city-wide ubiquitous computing as the aim, using a massive data-driven network of planet-scale online web-cameras and a location-based online social network service, Foursquare. Our approach examines six metropolitan regions' infrastructure that includes more than 800 locations, 25 million vehicular mobility records, 220k routes, and two million Foursquare check-ins. We evaluate the spatio-temporal correlation in traffic patterns, examine the structure and connectivity in regions, and study the impact of human mobility on vehicular traffic to gain insight for enabling city-wide ubiquitous computing.
{"title":"Data-driven study of urban infrastructure to enable city-wide ubiquitous computing","authors":"Gautam Thakur, Pan Hui, A. Helmy","doi":"10.1145/2501221.2501231","DOIUrl":"https://doi.org/10.1145/2501221.2501231","url":null,"abstract":"Engineering a city-wide ubiquitous computing system requires a comprehensive understanding of urban infrastructure including physical motorways, vehicular traffic, and human activities. Many world cities were built at different time periods and with different purposes that resulted in diversified structures and characteristics, which have to be carefully considered while designing ubiquitous computing facilities. In this paper, we propose a novel technique to study global urban infrastructure, with enabling city-wide ubiquitous computing as the aim, using a massive data-driven network of planet-scale online web-cameras and a location-based online social network service, Foursquare. Our approach examines six metropolitan regions' infrastructure that includes more than 800 locations, 25 million vehicular mobility records, 220k routes, and two million Foursquare check-ins. We evaluate the spatio-temporal correlation in traffic patterns, examine the structure and connectivity in regions, and study the impact of human mobility on vehicular traffic to gain insight for enabling city-wide ubiquitous computing.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Switching through the variety of available TV channels to find the most acceptable program at the current time can be very time-consuming. Especially at the prime time when there are lots of different channels offering quality content it is hard to find the best fitting channel. This paper introduces the TV Predictor, a new application that allows for obtaining personalized program recommendations without leaving the lean back position in front of the TV. Technically the usage of common Standards and Specifications, such as HbbTV, OIPF and W3C, leverage the convergence of broadband and broadcast media. Hints and details can overlay the broadcasting signal and so the user gets predictions in appropriate situations, for instance the most suitable movies playing tonight. Additionally the TV Predictor Autopilot enables the TV set to automatically change the currently viewed channel. A Second Screen Application mirrors the TV screen or displays additional content on tablet PCs and Smartphones. Based on the customers viewing behavior and explicit given ratings the server side application predicts what the viewer is going to favor. Different data mining approaches are combined in order to calculate the users preferences: Content Based Filtering algorithms for similar items, Collaborative Filtering algorithms for rating predictions, Clustering for increasing the performance, Association Rules for analyzing item relations and Support Vector Machines for the identification of behavior patterns. A ten fold cross validation shows an accuracy in prediction of about 80%. TV specialized User Interfaces, user generated feedback data and calculated algorithm results, such as Association Rules, are analyzed to underline the characteristics of such a TV based application.
{"title":"TV predictor: personalized program recommendations to be displayed on SmartTVs","authors":"Christopher Krauss, L. George, S. Arbanowski","doi":"10.1145/2501221.2501230","DOIUrl":"https://doi.org/10.1145/2501221.2501230","url":null,"abstract":"Switching through the variety of available TV channels to find the most acceptable program at the current time can be very time-consuming. Especially at the prime time when there are lots of different channels offering quality content it is hard to find the best fitting channel.\u0000 This paper introduces the TV Predictor, a new application that allows for obtaining personalized program recommendations without leaving the lean back position in front of the TV. Technically the usage of common Standards and Specifications, such as HbbTV, OIPF and W3C, leverage the convergence of broadband and broadcast media. Hints and details can overlay the broadcasting signal and so the user gets predictions in appropriate situations, for instance the most suitable movies playing tonight. Additionally the TV Predictor Autopilot enables the TV set to automatically change the currently viewed channel. A Second Screen Application mirrors the TV screen or displays additional content on tablet PCs and Smartphones.\u0000 Based on the customers viewing behavior and explicit given ratings the server side application predicts what the viewer is going to favor. Different data mining approaches are combined in order to calculate the users preferences: Content Based Filtering algorithms for similar items, Collaborative Filtering algorithms for rating predictions, Clustering for increasing the performance, Association Rules for analyzing item relations and Support Vector Machines for the identification of behavior patterns. A ten fold cross validation shows an accuracy in prediction of about 80%.\u0000 TV specialized User Interfaces, user generated feedback data and calculated algorithm results, such as Association Rules, are analyzed to underline the characteristics of such a TV based application.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129262752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One important challenge in data mining is the ability to deal with complex, voluminous and dynamic data. Indeed, due to the great advances in technology, in many real world applications data appear in the form of continuous data streams, as opposed to traditional static datasets. Several techniques have been proposed to explore data streams, in particular for the discovery of frequent co-occurrences in data. However, one of the common criticisms pointed out to frequent pattern mining is the fact that it generates a huge number of patterns, independent of user expertise, making it very hard to analyze and use the results. These bottlenecks are even more evident when dealing with data streams, since new data are continuously and endlessly arriving, and many intermediate results must be kept in memory. The use of constraints to filter the results is the most common and used approach to focus the discovery on what is really interesting. In this sense, there is a need for the integration of data stream mining with constrained mining. In this work we describe a set of strategies for pushing constraints into data stream mining, through the use of a pattern tree structure that captures a summary of the current possible patterns. We also propose an algorithm that discovers patterns in data streams that satisfy any user defined constraint.
{"title":"Pushing constraints into data streams","authors":"Andreia Silva, C. Antunes","doi":"10.1145/2501221.2501232","DOIUrl":"https://doi.org/10.1145/2501221.2501232","url":null,"abstract":"One important challenge in data mining is the ability to deal with complex, voluminous and dynamic data. Indeed, due to the great advances in technology, in many real world applications data appear in the form of continuous data streams, as opposed to traditional static datasets. Several techniques have been proposed to explore data streams, in particular for the discovery of frequent co-occurrences in data. However, one of the common criticisms pointed out to frequent pattern mining is the fact that it generates a huge number of patterns, independent of user expertise, making it very hard to analyze and use the results. These bottlenecks are even more evident when dealing with data streams, since new data are continuously and endlessly arriving, and many intermediate results must be kept in memory. The use of constraints to filter the results is the most common and used approach to focus the discovery on what is really interesting. In this sense, there is a need for the integration of data stream mining with constrained mining. In this work we describe a set of strategies for pushing constraints into data stream mining, through the use of a pattern tree structure that captures a summary of the current possible patterns. We also propose an algorithm that discovers patterns in data streams that satisfy any user defined constraint.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123700337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}