首页 > 最新文献

2010 IEEE International Conference on Data Mining Workshops最新文献

英文 中文
EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs 特征诊断:在大图中发现连接模式和异常值
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.203
Koji Maruhashi, C. Faloutsos
In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.
在一个大的加权图中,我们如何检测可疑的子图、模式和异常值?可疑的模式可能是一个近集团或连接两个或更多近集团的一组节点。这将改进计算机网络的入侵检测和网络流量监控。是否还有其他需要检测的网络模式?我们提出了特征诊断,这是一种快速识别这种模式的算法。该过程创建节点属性的散点图(例如特征分数、度和加权度),然后寻找类似线性的模式。我们的工具使用机器视觉的霍夫变换自动发现这样的情节。我们将特征诊断应用于各种各样的合成和真实数据(LBNL计算机流量,IMDB的电影演员数据,专利引用等)。特征诊断发现了令人惊讶的模式。它们似乎对应于端口扫描(在计算机网络中),具有类似僵尸网络行为的重复任务,电影演员数据中的奇怪“桥梁”(例如,由于演员改变职业)等等。优点是:(a)它能有效地发现令人惊讶的模式。(b)它是快速的(在边的数量上是线性的)(c)它是无参数的,(d)它是通用的,并且适用于许多不同的图,跨越几十gb。
{"title":"EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs","authors":"Koji Maruhashi, C. Faloutsos","doi":"10.1109/ICDMW.2010.203","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.203","url":null,"abstract":"In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Cluster Cores and Modularity Maximization 集群核心和模块化最大化
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.63
Michael Ovelgönne, A. Geyer-Schulz
The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.
模块化函数是一种广泛使用的度量图聚类质量的方法。寻找具有最大模块化的聚类是np困难的。因此,只有启发式算法能够处理大型数据集。近年来,关于这种启发式的大量文献已经发表。提出了一种仅利用目标函数梯度的局部信息的快速随机贪心算法。此外,我们提出了一种在计算最终聚类之前首先识别聚类的“核心”的方法。识别核心组的全局启发式方法解决了与纯局部方法相关的问题。使用所提出的算法,我们能够为许多真实世界的数据集计算出比以前任何算法都具有更高模块化的聚类。
{"title":"Cluster Cores and Modularity Maximization","authors":"Michael Ovelgönne, A. Geyer-Schulz","doi":"10.1109/ICDMW.2010.63","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.63","url":null,"abstract":"The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation 基于MapReduce实现的海量数据最小生成树分类模型
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.14
Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan
Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.
数据的快速增长为我们提供了更多的信息,但也挑战了传统的提取有用知识的技术。在本文中,我们提出了MCMM,一个基于最小生成树(MST)的海量数据分类模型,并实现了MapReduce。它可以看作是传统的K近邻方法和基于聚类的分类方法之间的一种中间模型,旨在克服它们的缺点和应对大数据量。我们的模型是在Hadoop平台上实现的,使用它的MapReduce编程框架,它特别适合云计算。我们在几个数据集上做了实验,包括来自UCI存储库的真实数据和合成数据,使用安装了Hadoop的Downing 4000集群。结果表明,在准确率和可扩展性方面,我们的模型在一般基础上优于KNN和其他一些分类方法。
{"title":"Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation","authors":"Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan","doi":"10.1109/ICDMW.2010.14","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.14","url":null,"abstract":"Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Large-Scale Matrix Factorization Using MapReduce 基于MapReduce的大规模矩阵分解
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.155
Zhengguo Sun, Tao Li, N. Rishe
Due to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has shown that it is feasible to factorize a million-by-million matrix with billions of nonzero elements on a MapReduce cluster. In this work, we present three different matrix multiplication implementations and scale up three types of nonnegative matrix factorizations on MapReduce. Experiments on both synthetic and real-world datasets show the excellent scalability of our proposed algorithms.
由于非负矩阵分解的普及和海量数据集的日益可用性,研究人员面临着分解数百万维数的大规模矩阵的问题。最近的研究[11]表明,在MapReduce集群上分解具有数十亿个非零元素的百万乘百万矩阵是可行的。在这项工作中,我们提出了三种不同的矩阵乘法实现,并在MapReduce上扩展了三种类型的非负矩阵分解。在合成数据集和真实数据集上的实验表明,我们提出的算法具有良好的可扩展性。
{"title":"Large-Scale Matrix Factorization Using MapReduce","authors":"Zhengguo Sun, Tao Li, N. Rishe","doi":"10.1109/ICDMW.2010.155","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.155","url":null,"abstract":"Due to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has shown that it is feasible to factorize a million-by-million matrix with billions of nonzero elements on a MapReduce cluster. In this work, we present three different matrix multiplication implementations and scale up three types of nonnegative matrix factorizations on MapReduce. Experiments on both synthetic and real-world datasets show the excellent scalability of our proposed algorithms.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation 使用SOM-Ward聚类和预测分析进行客户细分
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.121
Zhiyuan Yao, T. Eklund, B. Back
Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.
数据仓库中不断增加的数据量为公司提供了进行分析性客户关系管理(CRM)的充足机会。然而,如何利用从这些数据分析中检索到的信息来保留最有价值的客户,识别具有额外收入潜力的客户,并实现具有成本效益的客户关系管理,仍然是企业面临的挑战。本研究提出了一种结合SOM-Ward聚类和预测分析的两级方法,以细分拥有150万客户的案例公司的客户群。首先,根据客户的消费金额、人口特征和行为特征,采用SOM-Ward聚类方法将客户群体划分为7个细分市场:专属客户、高消费客户和5个大众客户。然后,采用支持向量机(SVM)、神经网络和决策树三种分类模型对高消费客户和低消费客户进行分类。对三种分类模型的性能进行了评价和比较。然后将这三种模型结合起来,从大众客户中预测潜在的高消费客户。研究发现,这种混合方法可以提供更全面和详细的客户群信息,特别是潜在的高收入贡献尚未开发的大众市场,为定制可操作的营销策略。
{"title":"Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation","authors":"Zhiyuan Yao, T. Eklund, B. Back","doi":"10.1109/ICDMW.2010.121","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.121","url":null,"abstract":"Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Improving Matching Process in Social Network 改进社交网络中的匹配过程
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.41
Lin Chen, R. Nayak, Yue Xu
Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.
作为社交网络的一种,在线约会网络越来越受欢迎。随着越来越多的人加入到这个网络中,用户在选择理想伴侣时面临着太多的选择。利用推荐方法可以克服这个问题。然而,传统的推荐方法对于在线约会网络来说是无效的和低效的,因为在线约会网络的数据集是稀疏的和/或大的,并且需要双向匹配。我们提出了一种方法,通过使用聚类,simmrank来推荐匹配候选人给在线约会网络中的用户。评估中使用了来自在线约会网络的数据。将该方法获得的推荐成功率与网络的基线成功率进行了比较,性能提高了一倍。
{"title":"Improving Matching Process in Social Network","authors":"Lin Chen, R. Nayak, Yue Xu","doi":"10.1109/ICDMW.2010.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.41","url":null,"abstract":"Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain 结合时间序列相似性和基于密度的聚类识别人脑纤维束
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.15
Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant
Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.
理解人类大脑的连接体是神经科学的一个重大挑战。发现大脑的线路和主要电缆对于更好地理解大脑的功能是必不可少的。弥散张量成像(Diffusion Tensor imaging, DTI)提供了一种非侵入性的探索人类白质纤维束组织的潜在方法。然而,从原始DT图像的大约一百万体素到可用的知识还有很长的路要走。纤维跟踪方法通过配准和运动校正等预处理,从扩散加权图像中提取数千条纤维。在本文中,我们关注的问题是我们如何识别代表大脑主要电缆的有意义的纤维轨迹组。我们将时间序列挖掘的思想与基于密度的聚类相结合,形成了一个新的框架,用于有效和高效的光纤聚类。首先介绍了一种基于动态时间扭曲的纤维相似度度量方法。这种纤维翘曲测量方法成功地捕获了属于同一束但具有不同起点和终点的纤维之间的局部相似性。这种纤维翘曲量的下界加快了计算速度。纤维跟踪的结果往往包含不完美的纤维和异常值。因此,我们将纤维翘曲与基于异常鲁棒密度的聚类算法相结合。在合成数据和真实数据上的大量实验证明了该方法的有效性和高效性。
{"title":"Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain","authors":"Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant","doi":"10.1109/ICDMW.2010.15","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.15","url":null,"abstract":"Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116911104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Enhancing Ubiquitous Systems through System Call Mining 通过系统调用挖掘增强泛在系统
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.133
K. Morik, F. Jungermann, N. Piatkowski, M. Engel
Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.
通过仪器仪表齐全的系统自动收集、监测和分析数据通常是由人类决策驱动的。然而,当要证明系统软件决策的合理性时,同样的需求也会出现。编译器优化或存储管理需要几个决策,这些决策会导致或多或少的资源消耗,无论是能源、内存还是运行时。可以收集大量的系统数据,以便根据经验分析对编译器或操作系统进行决策。如果收集和分析小型且经常移动的系统数据,则会加剧大规模数据的挑战。与庞大的数据量相比,移动设备提供的存储和计算能力非常有限。此外,如果将分析结果用于操作系统,则实时响应是在系统级别,而不是在人类反应时间级别。在本文中,小型且最常见的移动系统(即无处不在的系统)被用于收集系统调用数据。研究了学习方法是否考虑了系统调用的顺序和结构。一种结构学习方法,条件随机场(CRF),采用不同的内部优化算法和特征映射。使用通用图形处理器单元(GPGPU)以大规模并行的方式实现CRF,指向未来无处不在的系统。
{"title":"Enhancing Ubiquitous Systems through System Call Mining","authors":"K. Morik, F. Jungermann, N. Piatkowski, M. Engel","doi":"10.1109/ICDMW.2010.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.133","url":null,"abstract":"Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations 架起民间分类法和领域本体的桥梁:找出非分类关系
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.72
C. Trabelsi, A. Jrad, S. Yahia
Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.
社交图书标记工具在网络上迅速崛起,因为它可以被大量参与者所见证。在这样的空间中,用户通过他们认为相关的任何关键字或标签来注释资源,从而产生轻量级的概念结构,emph{即}大众分类法。在这方面,不必提及本体对于增强信息检索度量的好处。本文介绍了一种基于textit{大众分类法}的本体学习方法,该方法提供了标签之间的共享词汇表和语义关系。所介绍的方法的主要目的在于把重点放在发现textit{非分类学}关系上。后者经常被忽视,尽管从语义的角度来看它们是至关重要的。发现过程在很大程度上依赖于三元概念来发现和选择相关标签,并提取和标记相关标签与外部源之间的非分类关系,以进行标签过滤和非分类关系提取。此外,我们还讨论了一种基于WordNet知识库自动评估获得的关系的新方法,并为现实世界的textit{大众分类法}提供了有希望的结果。
{"title":"Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations","authors":"C. Trabelsi, A. Jrad, S. Yahia","doi":"10.1109/ICDMW.2010.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.72","url":null,"abstract":"Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System 在土耳其基于地址的人口登记系统中,使用自组织地图和启发式识别基于家庭社会经济指标的小统计区域
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.104
H. Düzgün, Seyma Ozcan Yavuzoglu
Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.
人口普查是一个国家历史上非常重要的事件。这些行动覆盖了国家及其公民的每一块土地和财产。基于空间单位的人口普查的出版是国家统计组织的重要问题之一,这需要确定小统计区域,即所谓的普查地理。自2006年以来,土耳其的目标是通过新的基于地址的登记信息系统(ABPRS),不再将人口普查数据作为“事实的”(静态的)数据,而是作为“事实的”(实时的)数据。此外,通过这种基于户籍的人口普查,个人信息与地址信息相匹配,人口普查获得了一个空间维度。然而,由于土耳其缺乏SSA,因此无法以较小的空间粒度发布数据。在本研究中,旨在采用空间聚类和分区方法来自动生成ssa,这些ssa基本上建立在ABPRS数据的基础上,并在地理信息系统(GIS)的帮助下进行地理参考。为了实现该算法,采用自组织映射(SOM)统一距离k-means聚类的模拟退火方法生成ABPRS的SSA。该方法基本上是在具有原始人口普查数据或从人口普查数据中获得的社会经济地位(SES)指数的块数据集上实现的。对案例研究区域的结果SSA进行评估。
{"title":"Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System","authors":"H. Düzgün, Seyma Ozcan Yavuzoglu","doi":"10.1109/ICDMW.2010.104","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.104","url":null,"abstract":"Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2010 IEEE International Conference on Data Mining Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1