首页 > 最新文献

BigMine '13最新文献

英文 中文
Direct out-of-memory distributed parallel frequent pattern mining 直接内存不足分布式并行频繁模式挖掘
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501229
Z. Rong, J. D. Knijf
Frequent itemset mining is a well studied and important problem in the datamining community. An abundance of different mining algorithms exists, all with different flavor and characteristics, but almost all suffer from two major shortcomings. First, in general frequent itemset mining algorithms perform exhaustive search over a huge pattern space. Second, most algorithms assume that the input data fits into main memory. The first problem was recently tackled in the work of [2], by direct sampling the required number of patterns over the pattern space. This paper extends the direct sampling approach by casting the algorithm into the MapReduce framework, effectively ceasing the memory requirements that the data should fit into main memory. The results show that the algorithm scales well for large data sets, while the memory requirements are solely dependent on the required number of patterns in the output.
频繁项集挖掘是数据挖掘界研究的一个重要问题。存在大量不同的挖掘算法,它们都具有不同的风格和特征,但几乎所有算法都存在两个主要缺点。首先,通常频繁项集挖掘算法在巨大的模式空间中执行穷举搜索。其次,大多数算法假设输入数据适合主存储器。第一个问题最近在[2]的工作中得到了解决,通过在模式空间上直接采样所需数量的模式。本文扩展了直接抽样的方法,将算法转换到MapReduce框架中,有效地停止了数据应该放在主存中的内存要求。结果表明,该算法适用于大型数据集,而内存需求仅取决于输出中所需模式的数量。
{"title":"Direct out-of-memory distributed parallel frequent pattern mining","authors":"Z. Rong, J. D. Knijf","doi":"10.1145/2501221.2501229","DOIUrl":"https://doi.org/10.1145/2501221.2501229","url":null,"abstract":"Frequent itemset mining is a well studied and important problem in the datamining community. An abundance of different mining algorithms exists, all with different flavor and characteristics, but almost all suffer from two major shortcomings. First, in general frequent itemset mining algorithms perform exhaustive search over a huge pattern space. Second, most algorithms assume that the input data fits into main memory. The first problem was recently tackled in the work of [2], by direct sampling the required number of patterns over the pattern space. This paper extends the direct sampling approach by casting the algorithm into the MapReduce framework, effectively ceasing the memory requirements that the data should fit into main memory. The results show that the algorithm scales well for large data sets, while the memory requirements are solely dependent on the required number of patterns in the output.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124404214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CAPRI: a tool for mining complex line patterns in large log data CAPRI:在大型日志数据中挖掘复杂线条模式的工具
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501228
F. Zulkernine, Patrick Martin, W. Powley, S. Soltani, Serge Mankovskii, Mark Addleman
Log files provide important information for troubleshooting complex systems. However, the structure and contents of the log data and messages vary widely. For automated processing, it is necessary to first understand the layout and the structure of the data, which becomes very challenging when a massive amount of data and messages are reported by different system components in the same log file. Existing approaches apply supervised mining techniques and return frequent patterns only for single line messages. We present CAPRI (type-CAsted Pattern and Rule mIner), which uses a novel pattern mining algorithm to efficiently mine structural line patterns from semi-structured multi-line log messages. It discovers line patterns in a type-casted format; categorizes all data lines; identifies frequent, rare and interesting line patterns, and uses unsupervised learning and incremental mining techniques. It also mines association rules to identify the contextual relationship between two successive line patterns. In addition, CAPRI lists the frequent term and value patterns given the minimum support thresholds. The line and term pattern information can be applied in the next stage to categorize and reformat multi-line data, extract variables from the messages, and discover further correlation among messages for troubleshooting complex systems. To evaluate our approach, we present a comparative study of our tool against some of the existing popular open-source research tools using three different layouts of log data including a complex multi-line log file from the z/OS mainframe system.
日志文件提供了排除复杂系统故障的重要信息。但是,日志数据和消息的结构和内容差别很大。对于自动化处理,有必要首先了解数据的布局和结构,当不同的系统组件在同一日志文件中报告大量数据和消息时,这变得非常具有挑战性。现有的方法采用监督挖掘技术,只返回单行消息的频繁模式。我们提出了CAPRI (type- cast Pattern and Rule mIner),它使用一种新颖的模式挖掘算法从半结构化的多行日志消息中有效地挖掘结构化的行模式。它发现类型转换格式的行模式;对所有数据线进行分类;识别频繁、罕见和有趣的线条模式,并使用无监督学习和增量挖掘技术。它还挖掘关联规则来识别两个连续的行模式之间的上下文关系。此外,CAPRI列出了给出最小支持阈值的常用术语和价值模式。可以在下一阶段应用行和项模式信息,对多行数据进行分类和重新格式化,从消息中提取变量,并进一步发现消息之间的相关性,以便对复杂系统进行故障排除。为了评估我们的方法,我们将我们的工具与现有的一些流行的开源研究工具进行了比较研究,使用三种不同的日志数据布局,包括来自z/OS大型机系统的复杂多行日志文件。
{"title":"CAPRI: a tool for mining complex line patterns in large log data","authors":"F. Zulkernine, Patrick Martin, W. Powley, S. Soltani, Serge Mankovskii, Mark Addleman","doi":"10.1145/2501221.2501228","DOIUrl":"https://doi.org/10.1145/2501221.2501228","url":null,"abstract":"Log files provide important information for troubleshooting complex systems. However, the structure and contents of the log data and messages vary widely. For automated processing, it is necessary to first understand the layout and the structure of the data, which becomes very challenging when a massive amount of data and messages are reported by different system components in the same log file. Existing approaches apply supervised mining techniques and return frequent patterns only for single line messages. We present CAPRI (type-CAsted Pattern and Rule mIner), which uses a novel pattern mining algorithm to efficiently mine structural line patterns from semi-structured multi-line log messages. It discovers line patterns in a type-casted format; categorizes all data lines; identifies frequent, rare and interesting line patterns, and uses unsupervised learning and incremental mining techniques. It also mines association rules to identify the contextual relationship between two successive line patterns. In addition, CAPRI lists the frequent term and value patterns given the minimum support thresholds. The line and term pattern information can be applied in the next stage to categorize and reformat multi-line data, extract variables from the messages, and discover further correlation among messages for troubleshooting complex systems. To evaluate our approach, we present a comparative study of our tool against some of the existing popular open-source research tools using three different layouts of log data including a complex multi-line log file from the z/OS mainframe system.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129779915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Big & personal: data and models behind netflix recommendations 大的和个人的:netflix推荐背后的数据和模型
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501222
X. Amatriain
Since the Netflix $1 million Prize, announced in 2006, our company has been known to have personalization at the core of our product. Even at that point in time, the dataset that we released was considered "large", and we stirred innovation in the (Big) Data Mining research field. Our current product offering is now focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search. In this paper, we will discuss the different approaches we follow to deal with these large streams of data in order to extract information for personalizing our service. We will describe some of the machine learning models used, as well as the architectures that allow us to combine complex offline batch processes with real-time data streams.
自从Netflix在2006年宣布获得100万美元奖金以来,我们公司一直以个性化作为我们产品的核心而闻名。即使在那个时候,我们发布的数据集也被认为是“大”的,我们在(大)数据挖掘研究领域掀起了创新。我们目前的产品主要集中在即时视频流媒体上,我们的数据现在已经大了很多个数量级。我们不仅在更多的国家拥有更多的用户,而且还接收到更多的数据流。除了评级,我们现在还使用诸如我们的成员播放,浏览或搜索的信息。在本文中,我们将讨论处理这些大型数据流的不同方法,以便为个性化服务提取信息。我们将描述所使用的一些机器学习模型,以及允许我们将复杂的离线批处理过程与实时数据流结合起来的架构。
{"title":"Big & personal: data and models behind netflix recommendations","authors":"X. Amatriain","doi":"10.1145/2501221.2501222","DOIUrl":"https://doi.org/10.1145/2501221.2501222","url":null,"abstract":"Since the Netflix $1 million Prize, announced in 2006, our company has been known to have personalization at the core of our product. Even at that point in time, the dataset that we released was considered \"large\", and we stirred innovation in the (Big) Data Mining research field. Our current product offering is now focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search.\u0000 In this paper, we will discuss the different approaches we follow to deal with these large streams of data in order to extract information for personalizing our service. We will describe some of the machine learning models used, as well as the architectures that allow us to combine complex offline batch processes with real-time data streams.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122671789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
Forecasting building occupancy using sensor network data 利用传感器网络数据预测建筑物入住率
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501233
James W. Howard, W. Hoff
Forecasting the occupancy of buildings can lead to significant improvement of smart heating and cooling systems. Using a sensor network of simple passive infrared motion sensors densely placed throughout a building, we perform data mining to forecast occupancy a short time (i.e., up to 60 minutes) into the future. Our approach is to train a set of standard forecasting models to our time series data. Each model then forecasts occupancy a various horizons into the future. We combine these forecasts using a modified Bayesian combined forecasting approach. The method is demonstrated on two large building occupancy datasets, and shows promising results for forecasting horizons of up to 60 minutes. Because the two datasets have such different occupancy profiles, we compare our algorithms on each dataset to evaluate the performance of the forecasting algorithm for the different conditions.
预测建筑物的入住率可以显著改善智能供暖和制冷系统。使用一个简单的被动红外运动传感器传感器网络密集地放置在整个建筑物中,我们执行数据挖掘来预测未来短时间(即长达60分钟)的占用情况。我们的方法是对我们的时间序列数据训练一组标准预测模型。然后,每个模型预测未来不同时期的入住率。我们使用改进的贝叶斯组合预测方法将这些预测结合起来。该方法在两个大型建筑占用数据集上进行了演示,并显示了长达60分钟的预测视野的良好结果。由于两个数据集具有如此不同的占用概况,我们比较了每个数据集上的算法,以评估不同条件下预测算法的性能。
{"title":"Forecasting building occupancy using sensor network data","authors":"James W. Howard, W. Hoff","doi":"10.1145/2501221.2501233","DOIUrl":"https://doi.org/10.1145/2501221.2501233","url":null,"abstract":"Forecasting the occupancy of buildings can lead to significant improvement of smart heating and cooling systems. Using a sensor network of simple passive infrared motion sensors densely placed throughout a building, we perform data mining to forecast occupancy a short time (i.e., up to 60 minutes) into the future. Our approach is to train a set of standard forecasting models to our time series data. Each model then forecasts occupancy a various horizons into the future. We combine these forecasts using a modified Bayesian combined forecasting approach. The method is demonstrated on two large building occupancy datasets, and shows promising results for forecasting horizons of up to 60 minutes. Because the two datasets have such different occupancy profiles, we compare our algorithms on each dataset to evaluate the performance of the forecasting algorithm for the different conditions.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115040130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Estimating building simulation parameters via Bayesian structure learning 基于贝叶斯结构学习的建筑仿真参数估计
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501226
Richard E. Edwards, J. New, L. Parker
Many key building design policies are made using sophisticated computer simulations such as EnergyPlus (E+), the DOE flagship whole-building energy simulation engine. E+ and other sophisticated computer simulations have several major problems. The two main issues are 1) gaps between the simulation model and the actual structure, and 2) limitations of the modeling engine's capabilities. Currently, these problems are addressed by having an engineer manually calibrate simulation parameters to real world data or using algorithmic optimization methods to adjust the building parameters. However, some simulations engines, like E+, are computationally expensive, which makes repeatedly evaluating the simulation engine costly. This work explores addressing this issue by automatically discovering the simulation's internal input and output dependencies from ~20 Gigabytes of E+ simulation data, future extensions will use ~200 Terabytes of E+ simulation data. The model is validated by inferring building parameters for E+ simulations with ground truth building parameters. Our results indicate that the model accurately represents parameter means with some deviation from the means, but does not support inferring parameter values that exist on the distribution's tail.
许多关键的建筑设计政策都是使用复杂的计算机模拟来制定的,比如EnergyPlus (E+),这是美国能源部旗舰的全建筑能源模拟引擎。E+和其他复杂的计算机模拟有几个主要问题。两个主要问题是1)仿真模型和实际结构之间的差距,以及2)建模引擎功能的限制。目前,这些问题的解决方法是由工程师手动校准模拟参数到现实世界的数据或使用算法优化方法来调整建筑参数。然而,一些模拟引擎,如E+,计算成本很高,这使得反复评估模拟引擎的成本很高。这项工作通过从~ 20gb的E+模拟数据中自动发现模拟的内部输入和输出依赖关系来探索解决这个问题,未来的扩展将使用~ 200tb的E+模拟数据。通过对E+仿真中真实建筑参数的推断,验证了模型的有效性。我们的结果表明,该模型准确地表示参数均值,但与均值有一定的偏差,但不支持推断分布尾部存在的参数值。
{"title":"Estimating building simulation parameters via Bayesian structure learning","authors":"Richard E. Edwards, J. New, L. Parker","doi":"10.1145/2501221.2501226","DOIUrl":"https://doi.org/10.1145/2501221.2501226","url":null,"abstract":"Many key building design policies are made using sophisticated computer simulations such as EnergyPlus (E+), the DOE flagship whole-building energy simulation engine. E+ and other sophisticated computer simulations have several major problems. The two main issues are 1) gaps between the simulation model and the actual structure, and 2) limitations of the modeling engine's capabilities. Currently, these problems are addressed by having an engineer manually calibrate simulation parameters to real world data or using algorithmic optimization methods to adjust the building parameters. However, some simulations engines, like E+, are computationally expensive, which makes repeatedly evaluating the simulation engine costly. This work explores addressing this issue by automatically discovering the simulation's internal input and output dependencies from ~20 Gigabytes of E+ simulation data, future extensions will use ~200 Terabytes of E+ simulation data. The model is validated by inferring building parameters for E+ simulations with ground truth building parameters. Our results indicate that the model accurately represents parameter means with some deviation from the means, but does not support inferring parameter values that exist on the distribution's tail.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"492 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123409184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An architecture for detecting events in real-time using massive heterogeneous data sources 一种利用海量异构数据源实时检测事件的体系结构
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501235
G. Valkanas, D. Gunopulos, Ioannis Boutsis, V. Kalogeraki
The wealth of information that is readily available nowadays grants researchers and practitioners the ability to develop techniques and applications that monitor and react to all sorts of circumstances: from network congestions to natural catastrophies. Therefore, it is no longer a question of whether this can be done, but how to do it in real-time, and if possible proactively. Consequently, it becomes a necessity to develop a platform that will aggregate all the necessary information and will orchestrate it in the best way possible, towards meeting these goals. A main problem that arises in such a setting is the high diversity of the incoming data, obtained from very different sources such as sensors, smart phones, GPS signals and social networks. The large volume of the incoming data is a gift that ensures high quality of the produced output, but also a curse, because higher computational resources are needed. In this paper, we present the architecture of a framework designed to gather, aggregate and process a wide range of sensory input coming from very different sources. A distinctive characteristic of our framework is the active involvement of citizens. We guide the description of how our framework meets our requirements through two indicative use cases.
如今唾手可得的丰富信息使研究人员和实践者能够开发技术和应用程序来监控和应对各种情况:从网络拥堵到自然灾害。因此,问题不再是能否做到这一点,而是如何实时做到这一点,如果可能的话,如何主动做到这一点。因此,有必要开发一个平台,它将聚集所有必要的信息,并以最好的方式对其进行编排,以实现这些目标。在这种情况下出现的一个主要问题是输入数据的高度多样性,这些数据来自非常不同的来源,如传感器、智能手机、GPS信号和社交网络。大量的输入数据是确保高质量输出的礼物,但也是一种诅咒,因为需要更高的计算资源。在本文中,我们提出了一个框架的架构,该框架旨在收集、汇总和处理来自不同来源的各种感官输入。我们框架的一个显著特点是公民的积极参与。我们通过两个指示性用例指导框架如何满足需求的描述。
{"title":"An architecture for detecting events in real-time using massive heterogeneous data sources","authors":"G. Valkanas, D. Gunopulos, Ioannis Boutsis, V. Kalogeraki","doi":"10.1145/2501221.2501235","DOIUrl":"https://doi.org/10.1145/2501221.2501235","url":null,"abstract":"The wealth of information that is readily available nowadays grants researchers and practitioners the ability to develop techniques and applications that monitor and react to all sorts of circumstances: from network congestions to natural catastrophies. Therefore, it is no longer a question of whether this can be done, but how to do it in real-time, and if possible proactively. Consequently, it becomes a necessity to develop a platform that will aggregate all the necessary information and will orchestrate it in the best way possible, towards meeting these goals. A main problem that arises in such a setting is the high diversity of the incoming data, obtained from very different sources such as sensors, smart phones, GPS signals and social networks. The large volume of the incoming data is a gift that ensures high quality of the produced output, but also a curse, because higher computational resources are needed. In this paper, we present the architecture of a framework designed to gather, aggregate and process a wide range of sensory input coming from very different sources. A distinctive characteristic of our framework is the active involvement of citizens. We guide the description of how our framework meets our requirements through two indicative use cases.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128800873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams soft - csgdt:用于代价敏感数据流分类的软代价敏感高斯决策树
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501223
Ning Guo, Yanhua Yu, Meina Song, Junde Song, Yu Fu
Nowadays in many real-world scenarios, high speed data streams are usually with non-uniform misclassification costs and thus call for cost-sensitive classification algorithms of data streams. However, only little literature focuses on this issue. On the other hand, the existing algorithms for cost-sensitive classification can achieve excellent performance in the metric of total misclassification costs, but always lead to obvious reduction of accuracy, which restrains the practical application greatly. In this paper, we present an improved folk theorem. Based on the new theorem, the existing accuracy-based classification algorithm can be converted into soft cost-sensitive one immediately, which allows us to take both accuracy and cost into account. Following the idea of this theorem, the soft-CsGDT algorithm is proposed to process the data streams with non-uniform misclassification costs, which is an expansion of GDT. With both synthetic and real-world datasets, the experimental results show that compared with the cost-sensitive algorithm, the accuracy in our soft-CsGDT is significantly improved, while the total misclassification costs are approximately the same.
目前,在许多现实场景中,高速数据流通常具有不均匀的误分类代价,因此需要对代价敏感的数据流分类算法。然而,只有很少的文献关注这个问题。另一方面,现有的代价敏感分类算法虽然在总误分类代价度量上表现优异,但往往导致准确率的明显降低,极大地制约了实际应用。本文提出了一个改进的民间定理。基于新定理,现有的基于精度的分类算法可以立即转化为软代价敏感的分类算法,从而使我们可以同时考虑精度和代价。根据该定理的思想,提出了soft-CsGDT算法来处理错误分类代价不均匀的数据流,这是对GDT的扩展。在合成数据集和真实数据集上,实验结果表明,与代价敏感算法相比,我们的软csgdt算法的准确率显著提高,而总误分类代价大致相同。
{"title":"Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams","authors":"Ning Guo, Yanhua Yu, Meina Song, Junde Song, Yu Fu","doi":"10.1145/2501221.2501223","DOIUrl":"https://doi.org/10.1145/2501221.2501223","url":null,"abstract":"Nowadays in many real-world scenarios, high speed data streams are usually with non-uniform misclassification costs and thus call for cost-sensitive classification algorithms of data streams. However, only little literature focuses on this issue. On the other hand, the existing algorithms for cost-sensitive classification can achieve excellent performance in the metric of total misclassification costs, but always lead to obvious reduction of accuracy, which restrains the practical application greatly. In this paper, we present an improved folk theorem. Based on the new theorem, the existing accuracy-based classification algorithm can be converted into soft cost-sensitive one immediately, which allows us to take both accuracy and cost into account. Following the idea of this theorem, the soft-CsGDT algorithm is proposed to process the data streams with non-uniform misclassification costs, which is an expansion of GDT. With both synthetic and real-world datasets, the experimental results show that compared with the cost-sensitive algorithm, the accuracy in our soft-CsGDT is significantly improved, while the total misclassification costs are approximately the same.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116298004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Data-driven study of urban infrastructure to enable city-wide ubiquitous computing 数据驱动的城市基础设施研究,实现全市普适计算
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501231
Gautam Thakur, Pan Hui, A. Helmy
Engineering a city-wide ubiquitous computing system requires a comprehensive understanding of urban infrastructure including physical motorways, vehicular traffic, and human activities. Many world cities were built at different time periods and with different purposes that resulted in diversified structures and characteristics, which have to be carefully considered while designing ubiquitous computing facilities. In this paper, we propose a novel technique to study global urban infrastructure, with enabling city-wide ubiquitous computing as the aim, using a massive data-driven network of planet-scale online web-cameras and a location-based online social network service, Foursquare. Our approach examines six metropolitan regions' infrastructure that includes more than 800 locations, 25 million vehicular mobility records, 220k routes, and two million Foursquare check-ins. We evaluate the spatio-temporal correlation in traffic patterns, examine the structure and connectivity in regions, and study the impact of human mobility on vehicular traffic to gain insight for enabling city-wide ubiquitous computing.
设计一个城市范围的普适计算系统需要全面了解城市基础设施,包括物理高速公路、车辆交通和人类活动。许多世界城市是在不同的时期和不同的目的建造的,导致了不同的结构和特征,在设计无处不在的计算设施时必须仔细考虑。在本文中,我们提出了一种研究全球城市基础设施的新技术,以实现城市范围内的普适计算为目标,使用由行星规模的在线网络摄像头和基于位置的在线社交网络服务Foursquare组成的大规模数据驱动网络。我们的方法检查了六个大都市地区的基础设施,包括800多个地点,2500万车辆移动记录,22万条路线和200万次Foursquare签到。我们评估了交通模式的时空相关性,检查了区域的结构和连通性,并研究了人类移动性对车辆交通的影响,以获得实现全市普适计算的见解。
{"title":"Data-driven study of urban infrastructure to enable city-wide ubiquitous computing","authors":"Gautam Thakur, Pan Hui, A. Helmy","doi":"10.1145/2501221.2501231","DOIUrl":"https://doi.org/10.1145/2501221.2501231","url":null,"abstract":"Engineering a city-wide ubiquitous computing system requires a comprehensive understanding of urban infrastructure including physical motorways, vehicular traffic, and human activities. Many world cities were built at different time periods and with different purposes that resulted in diversified structures and characteristics, which have to be carefully considered while designing ubiquitous computing facilities. In this paper, we propose a novel technique to study global urban infrastructure, with enabling city-wide ubiquitous computing as the aim, using a massive data-driven network of planet-scale online web-cameras and a location-based online social network service, Foursquare. Our approach examines six metropolitan regions' infrastructure that includes more than 800 locations, 25 million vehicular mobility records, 220k routes, and two million Foursquare check-ins. We evaluate the spatio-temporal correlation in traffic patterns, examine the structure and connectivity in regions, and study the impact of human mobility on vehicular traffic to gain insight for enabling city-wide ubiquitous computing.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TV predictor: personalized program recommendations to be displayed on SmartTVs 电视预测:个性化的节目推荐显示在智能电视上
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501230
Christopher Krauss, L. George, S. Arbanowski
Switching through the variety of available TV channels to find the most acceptable program at the current time can be very time-consuming. Especially at the prime time when there are lots of different channels offering quality content it is hard to find the best fitting channel. This paper introduces the TV Predictor, a new application that allows for obtaining personalized program recommendations without leaving the lean back position in front of the TV. Technically the usage of common Standards and Specifications, such as HbbTV, OIPF and W3C, leverage the convergence of broadband and broadcast media. Hints and details can overlay the broadcasting signal and so the user gets predictions in appropriate situations, for instance the most suitable movies playing tonight. Additionally the TV Predictor Autopilot enables the TV set to automatically change the currently viewed channel. A Second Screen Application mirrors the TV screen or displays additional content on tablet PCs and Smartphones. Based on the customers viewing behavior and explicit given ratings the server side application predicts what the viewer is going to favor. Different data mining approaches are combined in order to calculate the users preferences: Content Based Filtering algorithms for similar items, Collaborative Filtering algorithms for rating predictions, Clustering for increasing the performance, Association Rules for analyzing item relations and Support Vector Machines for the identification of behavior patterns. A ten fold cross validation shows an accuracy in prediction of about 80%. TV specialized User Interfaces, user generated feedback data and calculated algorithm results, such as Association Rules, are analyzed to underline the characteristics of such a TV based application.
在各种可用的电视频道之间切换以找到当前最可接受的节目是非常耗时的。特别是在黄金时段,当不同的频道提供高质量的内容时,很难找到最合适的频道。本文介绍了电视预测器,一个新的应用程序,允许获得个性化的节目推荐,而无需离开向后靠在电视机前的位置。从技术上讲,通用标准和规范(如HbbTV、OIPF和W3C)的使用利用了宽带和广播媒体的融合。提示和细节可以覆盖广播信号,因此用户可以在适当的情况下获得预测,例如今晚播放的最合适的电影。此外,电视预测自动驾驶仪使电视机自动改变当前观看的频道。第二屏应用程序可以在平板电脑和智能手机上复制电视屏幕或显示额外的内容。基于客户的观看行为和明确的给定评级,服务器端应用程序预测观看者会喜欢什么。为了计算用户偏好,不同的数据挖掘方法被结合在一起:基于内容的过滤算法用于相似项目,协同过滤算法用于评级预测,聚类用于提高性能,关联规则用于分析项目关系,支持向量机用于识别行为模式。十倍交叉验证表明预测的准确性约为80%。分析了电视专用用户界面、用户生成的反馈数据和关联规则等计算算法结果,以突出这种基于电视的应用程序的特点。
{"title":"TV predictor: personalized program recommendations to be displayed on SmartTVs","authors":"Christopher Krauss, L. George, S. Arbanowski","doi":"10.1145/2501221.2501230","DOIUrl":"https://doi.org/10.1145/2501221.2501230","url":null,"abstract":"Switching through the variety of available TV channels to find the most acceptable program at the current time can be very time-consuming. Especially at the prime time when there are lots of different channels offering quality content it is hard to find the best fitting channel.\u0000 This paper introduces the TV Predictor, a new application that allows for obtaining personalized program recommendations without leaving the lean back position in front of the TV. Technically the usage of common Standards and Specifications, such as HbbTV, OIPF and W3C, leverage the convergence of broadband and broadcast media. Hints and details can overlay the broadcasting signal and so the user gets predictions in appropriate situations, for instance the most suitable movies playing tonight. Additionally the TV Predictor Autopilot enables the TV set to automatically change the currently viewed channel. A Second Screen Application mirrors the TV screen or displays additional content on tablet PCs and Smartphones.\u0000 Based on the customers viewing behavior and explicit given ratings the server side application predicts what the viewer is going to favor. Different data mining approaches are combined in order to calculate the users preferences: Content Based Filtering algorithms for similar items, Collaborative Filtering algorithms for rating predictions, Clustering for increasing the performance, Association Rules for analyzing item relations and Support Vector Machines for the identification of behavior patterns. A ten fold cross validation shows an accuracy in prediction of about 80%.\u0000 TV specialized User Interfaces, user generated feedback data and calculated algorithm results, such as Association Rules, are analyzed to underline the characteristics of such a TV based application.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129262752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Pushing constraints into data streams 将约束推入数据流
Pub Date : 2013-08-11 DOI: 10.1145/2501221.2501232
Andreia Silva, C. Antunes
One important challenge in data mining is the ability to deal with complex, voluminous and dynamic data. Indeed, due to the great advances in technology, in many real world applications data appear in the form of continuous data streams, as opposed to traditional static datasets. Several techniques have been proposed to explore data streams, in particular for the discovery of frequent co-occurrences in data. However, one of the common criticisms pointed out to frequent pattern mining is the fact that it generates a huge number of patterns, independent of user expertise, making it very hard to analyze and use the results. These bottlenecks are even more evident when dealing with data streams, since new data are continuously and endlessly arriving, and many intermediate results must be kept in memory. The use of constraints to filter the results is the most common and used approach to focus the discovery on what is really interesting. In this sense, there is a need for the integration of data stream mining with constrained mining. In this work we describe a set of strategies for pushing constraints into data stream mining, through the use of a pattern tree structure that captures a summary of the current possible patterns. We also propose an algorithm that discovers patterns in data streams that satisfy any user defined constraint.
数据挖掘的一个重要挑战是处理复杂、大量和动态数据的能力。事实上,由于技术的巨大进步,在许多现实世界的应用程序中,数据以连续数据流的形式出现,而不是传统的静态数据集。已经提出了几种技术来探索数据流,特别是发现数据中频繁的共现现象。然而,对频繁模式挖掘的一个常见批评是,它生成了大量的模式,独立于用户的专业知识,这使得分析和使用结果变得非常困难。在处理数据流时,这些瓶颈甚至更加明显,因为新数据不断地到达,并且许多中间结果必须保存在内存中。使用约束来过滤结果是将发现重点放在真正有趣的内容上的最常见和常用的方法。从这个意义上说,有必要将数据流挖掘与约束挖掘相结合。在这项工作中,我们描述了一组将约束推入数据流挖掘的策略,通过使用捕获当前可能模式摘要的模式树结构。我们还提出了一种算法来发现数据流中满足任何用户定义约束的模式。
{"title":"Pushing constraints into data streams","authors":"Andreia Silva, C. Antunes","doi":"10.1145/2501221.2501232","DOIUrl":"https://doi.org/10.1145/2501221.2501232","url":null,"abstract":"One important challenge in data mining is the ability to deal with complex, voluminous and dynamic data. Indeed, due to the great advances in technology, in many real world applications data appear in the form of continuous data streams, as opposed to traditional static datasets. Several techniques have been proposed to explore data streams, in particular for the discovery of frequent co-occurrences in data. However, one of the common criticisms pointed out to frequent pattern mining is the fact that it generates a huge number of patterns, independent of user expertise, making it very hard to analyze and use the results. These bottlenecks are even more evident when dealing with data streams, since new data are continuously and endlessly arriving, and many intermediate results must be kept in memory. The use of constraints to filter the results is the most common and used approach to focus the discovery on what is really interesting. In this sense, there is a need for the integration of data stream mining with constrained mining. In this work we describe a set of strategies for pushing constraints into data stream mining, through the use of a pattern tree structure that captures a summary of the current possible patterns. We also propose an algorithm that discovers patterns in data streams that satisfy any user defined constraint.","PeriodicalId":441216,"journal":{"name":"BigMine '13","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123700337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
BigMine '13
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1