2010 IEEE International Conference on Data Mining Workshops最新文献

英文中文

Cluster Cores and Modularity Maximization 集群核心和模块化最大化

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.63

Michael Ovelgönne, A. Geyer-Schulz

The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.

模块化函数是一种广泛使用的度量图聚类质量的方法。寻找具有最大模块化的聚类是np困难的。因此，只有启发式算法能够处理大型数据集。近年来，关于这种启发式的大量文献已经发表。提出了一种仅利用目标函数梯度的局部信息的快速随机贪心算法。此外，我们提出了一种在计算最终聚类之前首先识别聚类的“核心”的方法。识别核心组的全局启发式方法解决了与纯局部方法相关的问题。使用所提出的算法，我们能够为许多真实世界的数据集计算出比以前任何算法都具有更高模块化的聚类。

引用次数: 22

Computing Popular Places Using Graphics Processors 使用图形处理器计算热门场所

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.45

Marta Fort, J. A. Sellarès, Nacho Valladares

Mobile devices provide the availability of tracking and collecting trajectories of moving objects such as vehicles, people or animals. There exists a well-known collection of patterns which can occur for a subset of trajectories. Specifically we study the so-called Popular Places, that is regions that are visited by many distinct moving objects.We propose algorithms to efficiently compute different forms of reporting Popular Places, that take benefit of the Graphics Processing Unit parallelism capabilities. We also describe how to visualize the reported solutions. Finally we present and discuss experimentalresults obtained with the implementation of our algorithms.

移动设备提供了跟踪和收集移动物体(如车辆、人或动物)轨迹的可用性。存在一个众所周知的模式集合，这些模式可以出现在轨迹的子集中。具体来说，我们研究所谓的热门地点，即许多不同的移动物体访问的区域。我们提出算法来有效地计算不同形式的流行地点报告，利用图形处理单元的并行能力。我们还描述了如何可视化报告的解决方案。最后给出并讨论了算法实现的实验结果。

引用次数: 4

EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs 特征诊断:在大图中发现连接模式和异常值

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.203

Koji Maruhashi, C. Faloutsos

In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.

在一个大的加权图中，我们如何检测可疑的子图、模式和异常值?可疑的模式可能是一个近集团或连接两个或更多近集团的一组节点。这将改进计算机网络的入侵检测和网络流量监控。是否还有其他需要检测的网络模式?我们提出了特征诊断，这是一种快速识别这种模式的算法。该过程创建节点属性的散点图(例如特征分数、度和加权度)，然后寻找类似线性的模式。我们的工具使用机器视觉的霍夫变换自动发现这样的情节。我们将特征诊断应用于各种各样的合成和真实数据(LBNL计算机流量，IMDB的电影演员数据，专利引用等)。特征诊断发现了令人惊讶的模式。它们似乎对应于端口扫描(在计算机网络中)，具有类似僵尸网络行为的重复任务，电影演员数据中的奇怪“桥梁”(例如，由于演员改变职业)等等。优点是:(a)它能有效地发现令人惊讶的模式。(b)它是快速的(在边的数量上是线性的)(c)它是无参数的，(d)它是通用的，并且适用于许多不同的图，跨越几十gb。

{"title":"EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs","authors":"Koji Maruhashi, C. Faloutsos","doi":"10.1109/ICDMW.2010.203","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.203","url":null,"abstract":"In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Clutter-Adaptive Visualization for Mobile Data Mining 移动数据挖掘的杂波自适应可视化

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.134

Brett Gillick, Hasnain AlTaiar, S. Krishnaswamy, J. Liono, Nicholas Nicoloudis, Abhijat Sinha, A. Zaslavsky, M. Gaber

There is an emerging focus on real-time data stream analysis on mobile devices. While many mobile data stream mining algorithms have been developed in recent times, generic and scalable visualization techniques have not been presented. This paper presents the demonstration of our innovative clutter-adaptive cluster visualization technique for mobile devices. We have fully implemented this technique on the Google Android platform and provide demonstrations for different datasets: location (both real and synthetic), and stock-market (real).

移动设备上的实时数据流分析正成为人们关注的焦点。虽然近年来开发了许多移动数据流挖掘算法，但尚未出现通用的可扩展可视化技术。本文展示了我们创新的移动设备自适应集群可视化技术。我们已经在谷歌Android平台上完全实现了这项技术，并提供了不同数据集的演示:位置(真实的和合成的)和股票市场(真实的)。

引用次数: 9

Semi-supervised PLSA for Document Clustering 半监督PLSA用于文档聚类

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.85

Lingfeng Niu, Yong Shi

By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.

通过利用数据中必须链接或不能链接对约束，半监督聚类显著提高了无监督聚类的性能。许多半监督聚类算法已经被提出来考虑这种对约束。但是，它们中的大多数都为每个数据项分配了一个硬标签，并且很少产生关于集群本身的信息。在这项工作中，我们提出了一种基于概率潜在语义分析(PLSA)的半监督算法，通过使用两个文档之间的必须链接监督来进行文档聚类，这在许多现实世界的数据中都是可用的。该算法可以生成每个文档的软聚类标签分配以及聚类中潜在主题的概率表示。除了标准PLSA的参数外，不需要估计其他参数。这降低了过度拟合的风险，特别是当数据稀疏时。我们提供了半监督PLSA的期望最大化(EM)过程，以确定最大化似然的局部最优参数。为了在大规模数据集上利用多个计算节点，我们还提出了一种基于MapReduce框架的EM过程的分布式实现。在公共数据集上的实验结果验证了新方法的有效性和高效性。

{"title":"Semi-supervised PLSA for Document Clustering","authors":"Lingfeng Niu, Yong Shi","doi":"10.1109/ICDMW.2010.85","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.85","url":null,"abstract":"By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132968509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Improving Matching Process in Social Network 改进社交网络中的匹配过程

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.41

Lin Chen, R. Nayak, Yue Xu

Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.

作为社交网络的一种，在线约会网络越来越受欢迎。随着越来越多的人加入到这个网络中，用户在选择理想伴侣时面临着太多的选择。利用推荐方法可以克服这个问题。然而，传统的推荐方法对于在线约会网络来说是无效的和低效的，因为在线约会网络的数据集是稀疏的和/或大的，并且需要双向匹配。我们提出了一种方法，通过使用聚类，simmrank来推荐匹配候选人给在线约会网络中的用户。评估中使用了来自在线约会网络的数据。将该方法获得的推荐成功率与网络的基线成功率进行了比较，性能提高了一倍。

引用次数: 15

Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain 结合时间序列相似性和基于密度的聚类识别人脑纤维束

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.15

Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant

Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.

理解人类大脑的连接体是神经科学的一个重大挑战。发现大脑的线路和主要电缆对于更好地理解大脑的功能是必不可少的。弥散张量成像(Diffusion Tensor imaging, DTI)提供了一种非侵入性的探索人类白质纤维束组织的潜在方法。然而，从原始DT图像的大约一百万体素到可用的知识还有很长的路要走。纤维跟踪方法通过配准和运动校正等预处理，从扩散加权图像中提取数千条纤维。在本文中，我们关注的问题是我们如何识别代表大脑主要电缆的有意义的纤维轨迹组。我们将时间序列挖掘的思想与基于密度的聚类相结合，形成了一个新的框架，用于有效和高效的光纤聚类。首先介绍了一种基于动态时间扭曲的纤维相似度度量方法。这种纤维翘曲测量方法成功地捕获了属于同一束但具有不同起点和终点的纤维之间的局部相似性。这种纤维翘曲量的下界加快了计算速度。纤维跟踪的结果往往包含不完美的纤维和异常值。因此，我们将纤维翘曲与基于异常鲁棒密度的聚类算法相结合。在合成数据和真实数据上的大量实验证明了该方法的有效性和高效性。

{"title":"Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain","authors":"Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant","doi":"10.1109/ICDMW.2010.15","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.15","url":null,"abstract":"Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116911104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Enhancing Ubiquitous Systems through System Call Mining 通过系统调用挖掘增强泛在系统

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.133

K. Morik, F. Jungermann, N. Piatkowski, M. Engel

Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.

通过仪器仪表齐全的系统自动收集、监测和分析数据通常是由人类决策驱动的。然而，当要证明系统软件决策的合理性时，同样的需求也会出现。编译器优化或存储管理需要几个决策，这些决策会导致或多或少的资源消耗，无论是能源、内存还是运行时。可以收集大量的系统数据，以便根据经验分析对编译器或操作系统进行决策。如果收集和分析小型且经常移动的系统数据，则会加剧大规模数据的挑战。与庞大的数据量相比，移动设备提供的存储和计算能力非常有限。此外，如果将分析结果用于操作系统，则实时响应是在系统级别，而不是在人类反应时间级别。在本文中，小型且最常见的移动系统(即无处不在的系统)被用于收集系统调用数据。研究了学习方法是否考虑了系统调用的顺序和结构。一种结构学习方法，条件随机场(CRF)，采用不同的内部优化算法和特征映射。使用通用图形处理器单元(GPGPU)以大规模并行的方式实现CRF，指向未来无处不在的系统。

{"title":"Enhancing Ubiquitous Systems through System Call Mining","authors":"K. Morik, F. Jungermann, N. Piatkowski, M. Engel","doi":"10.1109/ICDMW.2010.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.133","url":null,"abstract":"Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations 架起民间分类法和领域本体的桥梁:找出非分类关系

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.72

C. Trabelsi, A. Jrad, S. Yahia

Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.

社交图书标记工具在网络上迅速崛起，因为它可以被大量参与者所见证。在这样的空间中，用户通过他们认为相关的任何关键字或标签来注释资源，从而产生轻量级的概念结构，emph{即}大众分类法。在这方面，不必提及本体对于增强信息检索度量的好处。本文介绍了一种基于textit{大众分类法}的本体学习方法，该方法提供了标签之间的共享词汇表和语义关系。所介绍的方法的主要目的在于把重点放在发现textit{非分类学}关系上。后者经常被忽视，尽管从语义的角度来看它们是至关重要的。发现过程在很大程度上依赖于三元概念来发现和选择相关标签，并提取和标记相关标签与外部源之间的非分类关系，以进行标签过滤和非分类关系提取。此外，我们还讨论了一种基于WordNet知识库自动评估获得的关系的新方法，并为现实世界的textit{大众分类法}提供了有希望的结果。

{"title":"Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations","authors":"C. Trabelsi, A. Jrad, S. Yahia","doi":"10.1109/ICDMW.2010.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.72","url":null,"abstract":"Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System 在土耳其基于地址的人口登记系统中，使用自组织地图和启发式识别基于家庭社会经济指标的小统计区域

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.104

H. Düzgün, Seyma Ozcan Yavuzoglu

Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.

人口普查是一个国家历史上非常重要的事件。这些行动覆盖了国家及其公民的每一块土地和财产。基于空间单位的人口普查的出版是国家统计组织的重要问题之一，这需要确定小统计区域，即所谓的普查地理。自2006年以来，土耳其的目标是通过新的基于地址的登记信息系统(ABPRS)，不再将人口普查数据作为“事实的”(静态的)数据，而是作为“事实的”(实时的)数据。此外，通过这种基于户籍的人口普查，个人信息与地址信息相匹配，人口普查获得了一个空间维度。然而，由于土耳其缺乏SSA，因此无法以较小的空间粒度发布数据。在本研究中，旨在采用空间聚类和分区方法来自动生成ssa，这些ssa基本上建立在ABPRS数据的基础上，并在地理信息系统(GIS)的帮助下进行地理参考。为了实现该算法，采用自组织映射(SOM)统一距离k-means聚类的模拟退火方法生成ABPRS的SSA。该方法基本上是在具有原始人口普查数据或从人口普查数据中获得的社会经济地位(SES)指数的块数据集上实现的。对案例研究区域的结果SSA进行评估。

{"title":"Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System","authors":"H. Düzgün, Seyma Ozcan Yavuzoglu","doi":"10.1109/ICDMW.2010.104","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.104","url":null,"abstract":"Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 IEEE International Conference on Data Mining Workshops

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀