In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.
{"title":"EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs","authors":"Koji Maruhashi, C. Faloutsos","doi":"10.1109/ICDMW.2010.203","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.203","url":null,"abstract":"In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.
{"title":"Cluster Cores and Modularity Maximization","authors":"Michael Ovelgönne, A. Geyer-Schulz","doi":"10.1109/ICDMW.2010.63","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.63","url":null,"abstract":"The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many real-world datasets a clustering with a higher modularity than any algorithm before.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan
Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.
{"title":"Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation","authors":"Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan","doi":"10.1109/ICDMW.2010.14","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.14","url":null,"abstract":"Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has shown that it is feasible to factorize a million-by-million matrix with billions of nonzero elements on a MapReduce cluster. In this work, we present three different matrix multiplication implementations and scale up three types of nonnegative matrix factorizations on MapReduce. Experiments on both synthetic and real-world datasets show the excellent scalability of our proposed algorithms.
{"title":"Large-Scale Matrix Factorization Using MapReduce","authors":"Zhengguo Sun, Tao Li, N. Rishe","doi":"10.1109/ICDMW.2010.155","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.155","url":null,"abstract":"Due to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has shown that it is feasible to factorize a million-by-million matrix with billions of nonzero elements on a MapReduce cluster. In this work, we present three different matrix multiplication implementations and scale up three types of nonnegative matrix factorizations on MapReduce. Experiments on both synthetic and real-world datasets show the excellent scalability of our proposed algorithms.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.
{"title":"Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation","authors":"Zhiyuan Yao, T. Eklund, B. Back","doi":"10.1109/ICDMW.2010.121","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.121","url":null,"abstract":"Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.
{"title":"Improving Matching Process in Social Network","authors":"Lin Chen, R. Nayak, Yue Xu","doi":"10.1109/ICDMW.2010.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.41","url":null,"abstract":"Online dating networks, a type of social network, are gaining popularity. With many people joining and being available in the network, users are overwhelmed with choices when choosing their ideal partners. This problem can be overcome by utilizing recommendation methods. However, traditional recommendation methods are ineffective and inefficient for online dating networks where the dataset is sparse and/or large and two-way matching is required. We propose a methodology by using clustering, SimRank to recommend matching candidates to users in an online dating network. Data from a live online dating network is used in evaluation. The success rate of recommendation obtained using the proposed method is compared with baseline success rate of the network and the performance is improved by double.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant
Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.
{"title":"Combining Time Series Similarity with Density-Based Clustering to Identify Fiber Bundles in the Human Brain","authors":"Junming Shao, K. Hahn, Qinli Yang, C. Böhm, A. Wohlschläger, Nicholas Myers, C. Plant","doi":"10.1109/ICDMW.2010.15","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.15","url":null,"abstract":"Understanding the connectome of the human brain is a major challenge in neuroscience. Discovering the wiring and the major cables of the brain is essential for a better understanding of brain function. Diffusion Tensor imaging (DTI) provides the potential way of exploring the organization of white matter fiber tracts in human subjects in a non-invasive way. However, it is a long way from the approximately one million voxels of a raw DT image to utilizable knowledge. After preprocessing including registration and motion correction, fiber tracking approaches extract thousands of fibers from diffusion weighted images. In this paper, we focus on the question how we can identify meaningful groups of fiber tracks which represent the major cables of the brain. We combine ideas from time series mining with density-based clustering to a novel framework for effective and efficient fiber clustering. We first introduce a novel fiber similarity measure based on dynamic time warping. This fiber warping measure successfully captures local similarity among fibers belonging to a common bundle but having different start and end points. A lower bound on this fiber warping measure speeds up computation. The result of fiber tracking often contains imperfect fibers and outliers. Therefore, we combine fiber warping with an outlier-robust density-based clustering algorithm. Extensive experiments on synthetic data and real data demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116911104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.
{"title":"Enhancing Ubiquitous Systems through System Call Mining","authors":"K. Morik, F. Jungermann, N. Piatkowski, M. Engel","doi":"10.1109/ICDMW.2010.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.133","url":null,"abstract":"Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.
{"title":"Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations","authors":"C. Trabelsi, A. Jrad, S. Yahia","doi":"10.1109/ICDMW.2010.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.72","url":null,"abstract":"Social book marking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures emph{aka} folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics. In this paper, we introduce a novel approach for ontology learning from a textit{folksonomy}, which provide shared vocabularies and semantic relations between tags. The main thrust of the introduced approach stands in putting the focus on the discovery of textit{non-taxonomic} relationships. The latter are often neglected, even though they are of paramount importance from a semantic point of view. The discovery process heavily relies on triadic concepts to discover and select related tags and to extract and label non-taxonomically relationships between related tags and external sources for tags filtering and non-taxonomic relationships extraction. In addition, we also discuss a new approach to evaluate obtained relations in an automatic way against WordNet repository and presents promising results for a real world textit{folksonomy}.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.
{"title":"Using Self-Organizing Map and Heuristics to Identify Small Statistical Areas Based on Household Socio-Economic Indicators in Turkey's Address Based Population Register System","authors":"H. Düzgün, Seyma Ozcan Yavuzoglu","doi":"10.1109/ICDMW.2010.104","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.104","url":null,"abstract":"Census operations are very important events in the history of a nation. These operations cover every bit of land and property of the country and its citizens. The publication of census based on spatial units is one of the important problems of national statistical organizations, which requires determination of small statistical areas (SSAs) or so called census geography. Since 2006, Turkey aims to produce census data not as “de-facto” (static) but as “de-jure” (real-time) by the new Address Based Register Information System (ABPRS). Besides, by this new register based census, personal information is matched with their address information and censuses gained a spatial dimension. However, as Turkey lacks SSA’s, the data cannot be published in smaller spatial granularities. In this study, it is aimed to employ a spatial clustering and districting methodology to automatically produce SSAs which are basically built upon the ABPRS data that is geo-referenced with the aid of geographical information systems (GIS). For its realization, simulated annealing on k-means clustering of Self-Organizing Map (SOM) unified distances is employed to produce SSA’s for ABPRS. This method is basically implemented on block datasets having either raw census data or socio-economic status (SES) indices obtained from census data. The resulting SSA’s are evaluated for the case study area.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124739470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}