The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.
{"title":"Cluster Cores and Modularity Maximization","authors":"Michael Ovelgönne, A. Geyer-Schulz","doi":"10.1109/ICDMW.2010.63","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.63","url":null,"abstract":"The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile devices provide the availability of tracking and collecting trajectories of moving objects such as vehicles, people or animals. There exists a well-known collection of patterns which can occur for a subset of trajectories. Specifically we study the so-called Popular Places, that is regions that are visited by many distinct moving objects.We propose algorithms to efficiently compute different forms of reporting Popular Places, that take benefit of the Graphics Processing Unit parallelism capabilities. We also describe how to visualize the reported solutions. Finally we present and discuss experimentalresults obtained with the implementation of our algorithms.
{"title":"Computing Popular Places Using Graphics Processors","authors":"Marta Fort, J. A. Sellarès, Nacho Valladares","doi":"10.1109/ICDMW.2010.45","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.45","url":null,"abstract":"Mobile devices provide the availability of tracking and collecting trajectories of moving objects such as vehicles, people or animals. There exists a well-known collection of patterns which can occur for a subset of trajectories. Specifically we study the so-called Popular Places, that is regions that are visited by many distinct moving objects.We propose algorithms to efficiently compute different forms of reporting Popular Places, that take benefit of the Graphics Processing Unit parallelism capabilities. We also describe how to visualize the reported solutions. Finally we present and discuss experimentalresults obtained with the implementation of our algorithms.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128166041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.
{"title":"EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs","authors":"Koji Maruhashi, C. Faloutsos","doi":"10.1109/ICDMW.2010.203","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.203","url":null,"abstract":"In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brett Gillick, Hasnain AlTaiar, S. Krishnaswamy, J. Liono, Nicholas Nicoloudis, Abhijat Sinha, A. Zaslavsky, M. Gaber
There is an emerging focus on real-time data stream analysis on mobile devices. While many mobile data stream mining algorithms have been developed in recent times, generic and scalable visualization techniques have not been presented. This paper presents the demonstration of our innovative clutter-adaptive cluster visualization technique for mobile devices. We have fully implemented this technique on the Google Android platform and provide demonstrations for different datasets: location (both real and synthetic), and stock-market (real).
{"title":"Clutter-Adaptive Visualization for Mobile Data Mining","authors":"Brett Gillick, Hasnain AlTaiar, S. Krishnaswamy, J. Liono, Nicholas Nicoloudis, Abhijat Sinha, A. Zaslavsky, M. Gaber","doi":"10.1109/ICDMW.2010.134","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.134","url":null,"abstract":"There is an emerging focus on real-time data stream analysis on mobile devices. While many mobile data stream mining algorithms have been developed in recent times, generic and scalable visualization techniques have not been presented. This paper presents the demonstration of our innovative clutter-adaptive cluster visualization technique for mobile devices. We have fully implemented this technique on the Google Android platform and provide demonstrations for different datasets: location (both real and synthetic), and stock-market (real).","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134144867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.
{"title":"Semi-supervised PLSA for Document Clustering","authors":"Lingfeng Niu, Yong Shi","doi":"10.1109/ICDMW.2010.85","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.85","url":null,"abstract":"By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132968509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.
{"title":"Improving Matching Process in Social Network","authors":"Lin Chen, R. Nayak, Yue Xu","doi":"10.1109/ICDMW.2010.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.41","url":null,"abstract":"Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant
Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.
{"title":"Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain","authors":"Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant","doi":"10.1109/ICDMW.2010.15","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.15","url":null,"abstract":"Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116911104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.
{"title":"Enhancing Ubiquitous Systems through System Call Mining","authors":"K. Morik, F. Jungermann, N. Piatkowski, M. Engel","doi":"10.1109/ICDMW.2010.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.133","url":null,"abstract":"Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.
{"title":"Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations","authors":"C. Trabelsi, A. Jrad, S. Yahia","doi":"10.1109/ICDMW.2010.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.72","url":null,"abstract":"Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.
{"title":"Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System","authors":"H. Düzgün, Seyma Ozcan Yavuzoglu","doi":"10.1109/ICDMW.2010.104","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.104","url":null,"abstract":"Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}