In this paper we consider the problem of discovering frequent temporal patterns in a database of temporal sequences, where a temporal sequence is a set of items with associated dates and durations. Since the quantitative temporal information appears to be fundamental in many contexts, it is taken into account in the mining processes and returned as part of the extracted knowledge. To this end, we have adapted the classical a priori (Agrawal and Srikant, 1995) framework to propose an efficient algorithm based on a hyper-cube representation of temporal sequences. The extraction of quantitative temporal information is performed using a density estimation of the distribution of event intervals from the temporal sequences. An evaluation on synthetic data sets shows that the proposed algorithm can robustly extract frequent temporal patterns with quantitative temporal extents.
{"title":"Mining Temporal Patterns with Quantitative Intervals","authors":"Thomas Guyet, R. Quiniou","doi":"10.1109/ICDMW.2008.16","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.16","url":null,"abstract":"In this paper we consider the problem of discovering frequent temporal patterns in a database of temporal sequences, where a temporal sequence is a set of items with associated dates and durations. Since the quantitative temporal information appears to be fundamental in many contexts, it is taken into account in the mining processes and returned as part of the extracted knowledge. To this end, we have adapted the classical a priori (Agrawal and Srikant, 1995) framework to propose an efficient algorithm based on a hyper-cube representation of temporal sequences. The extraction of quantitative temporal information is performed using a density estimation of the distribution of event intervals from the temporal sequences. An evaluation on synthetic data sets shows that the proposed algorithm can robustly extract frequent temporal patterns with quantitative temporal extents.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130914839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weka4WS is an extension of the Weka toolkit to support remote execution of data mining tasks as grid services. A first version of Weka4WS supporting concurrent execution of multiple data mining tasks on remote grid nodes has been presented in a previous work. In this paper we present a new version supporting also the composition and execution of data mining workflows on a grid. This new version of Weka4WS extends the KnowledgeFlow component of Weka by allowing the data mining tasks of the workflow to run in parallel on different machines, hence reducing the execution time. Besides the performance improvement, the capability of designing data mining applications as workflows allows to define typical patterns and to reuse them in different contexts. In this paper we describe the architecture of the system, the functionalities of the Weka4WS KnowledgeFlow, and some examples of use with their performance.
{"title":"Service Oriented KDD: A Framework for Grid Data Mining Workflows","authors":"M. Lackovic, D. Talia, Paolo Trunfio","doi":"10.1109/ICDMW.2008.28","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.28","url":null,"abstract":"Weka4WS is an extension of the Weka toolkit to support remote execution of data mining tasks as grid services. A first version of Weka4WS supporting concurrent execution of multiple data mining tasks on remote grid nodes has been presented in a previous work. In this paper we present a new version supporting also the composition and execution of data mining workflows on a grid. This new version of Weka4WS extends the KnowledgeFlow component of Weka by allowing the data mining tasks of the workflow to run in parallel on different machines, hence reducing the execution time. Besides the performance improvement, the capability of designing data mining applications as workflows allows to define typical patterns and to reuse them in different contexts. In this paper we describe the architecture of the system, the functionalities of the Weka4WS KnowledgeFlow, and some examples of use with their performance.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114512296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transductive learning is the learning setting that permits to learn from "particular to particular'' and to consider both labelled and unlabelled examples when taking classification decisions. In this paper, we investigate the use of transductive learning in the context of hierarchical text categorization. At this aim, we exploit a modified version of an inductive hierarchical learning framework that permits to classify documents in internal and leaf nodes of a hierarchy of categories. Experimental results on real world datasets are reported.
{"title":"Hierarchical Text Categorization in a Transductive Setting","authors":"Michelangelo Ceci","doi":"10.1109/ICDMW.2008.126","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.126","url":null,"abstract":"Transductive learning is the learning setting that permits to learn from \"particular to particular'' and to consider both labelled and unlabelled examples when taking classification decisions. In this paper, we investigate the use of transductive learning in the context of hierarchical text categorization. At this aim, we exploit a modified version of an inductive hierarchical learning framework that permits to classify documents in internal and leaf nodes of a hierarchy of categories. Experimental results on real world datasets are reported.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115455015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a support system for composing good titles for research papers in order to reach new audiences. Our system takes titles as input. The system evaluates title understandability and interest level of a title. The system ranks titles and outputs a title list. Users are able to recompose their titles by referring to the list and each evaluation value. Using the system, users can obtain new audiences who have not previously been interested in the userpsilas research area. Experimental results showed that our system is able to rank titles in descending order of audiencespsila choices.
{"title":"Title-Composing Support System for Reaching New Audiences","authors":"Yoko Nishihara, W. Sunayama","doi":"10.1109/ICDMW.2008.24","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.24","url":null,"abstract":"This paper proposes a support system for composing good titles for research papers in order to reach new audiences. Our system takes titles as input. The system evaluates title understandability and interest level of a title. The system ranks titles and outputs a title list. Users are able to recompose their titles by referring to the list and each evaluation value. Using the system, users can obtain new audiences who have not previously been interested in the userpsilas research area. Experimental results showed that our system is able to rank titles in descending order of audiencespsila choices.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115634224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An association rule (AR) is a common knowledge model in data mining that describes an implicative co-occurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an "antecedent rArr consequent" rule. A variant of the AR is the weighted association rule (WAR). With regard to a marketing context, this paper introduces a new knowledge model in data mining - allocating pattern (ALP). An ALP is a special form of WAR, where each rule item is associated with a weighting score between 0 and 1, and the sum of all rule item scores is 1. It can not only indicate the implicative co-occurring relationship between two (disjoint) sets of items in a weighted setting, but also inform the "allocating" relationship among rule items. ALPs can be demonstrated to be applicable in marketing and possibly a surprising variety of other areas. We further propose an apriori based algorithm to extract hidden and interesting ALPs from a "one-sum" weighted transaction database. The experimental results show the effectiveness of the proposed algorithm.
{"title":"Mining Allocating Patterns in One-Sum Weighted Items","authors":"Y. Wang, Xinwei Zheng, Frans Coenen, Cindy Y. Li","doi":"10.1109/ICDMW.2008.112","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.112","url":null,"abstract":"An association rule (AR) is a common knowledge model in data mining that describes an implicative co-occurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an \"antecedent rArr consequent\" rule. A variant of the AR is the weighted association rule (WAR). With regard to a marketing context, this paper introduces a new knowledge model in data mining - allocating pattern (ALP). An ALP is a special form of WAR, where each rule item is associated with a weighting score between 0 and 1, and the sum of all rule item scores is 1. It can not only indicate the implicative co-occurring relationship between two (disjoint) sets of items in a weighted setting, but also inform the \"allocating\" relationship among rule items. ALPs can be demonstrated to be applicable in marketing and possibly a surprising variety of other areas. We further propose an apriori based algorithm to extract hidden and interesting ALPs from a \"one-sum\" weighted transaction database. The experimental results show the effectiveness of the proposed algorithm.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122243913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keigo Yoshida, M. Inui, T. Yairi, K. Machida, Masaki Shioya, Y. Masukawa
This paper addresses the identification problem of causal variables for the system anomaly. In real-world complicated systems, even experts often fail to specify causal factors, thus they attempt to detect the anomaly with exploratory heuristics. Our goal is to offer further information that supports anomaly cause analysis using the incomplete empirical knowledge. Proposed technique discovers responsible factors for the fault by leveraging domain knowledge with an effective combination of semi-supervised linear discriminant analysis (LDA) and boundary-based discriminative subspace identification method. Experimental results on synthetic and real dataset confirmed validity of our approach. Moreover, we applied this method to the building energy fault diagnosis and succeeded in extracting causal variables for energy waste in a building.
{"title":"Identification of Causal Variables for Building Energy Fault Detection by Semi-supervised LDA and Decision Boundary Analysis","authors":"Keigo Yoshida, M. Inui, T. Yairi, K. Machida, Masaki Shioya, Y. Masukawa","doi":"10.1109/ICDMW.2008.44","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.44","url":null,"abstract":"This paper addresses the identification problem of causal variables for the system anomaly. In real-world complicated systems, even experts often fail to specify causal factors, thus they attempt to detect the anomaly with exploratory heuristics. Our goal is to offer further information that supports anomaly cause analysis using the incomplete empirical knowledge. Proposed technique discovers responsible factors for the fault by leveraging domain knowledge with an effective combination of semi-supervised linear discriminant analysis (LDA) and boundary-based discriminative subspace identification method. Experimental results on synthetic and real dataset confirmed validity of our approach. Moreover, we applied this method to the building energy fault diagnosis and succeeded in extracting causal variables for energy waste in a building.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115377070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sasi K. Pitchaimalai, C. Ordonez, Carlos Garcia-Alvarado
Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known K-means clustering algorithm. We present SQL query optimizations and a scalar UDF to compute Euclidean distance. We experimentally evaluate performance and scalability of our proposed SQL queries and UDF with large data sets on a modern DBMS. We benchmark distance computation on two important data mining techniques: clustering and classification. In general, UDFs are faster than SQL queries because they are executed in main memory. Data set size is the main factor impacting performance, followed by data set dimensionality.
{"title":"Efficient Distance Computation Using SQL Queries and UDFs","authors":"Sasi K. Pitchaimalai, C. Ordonez, Carlos Garcia-Alvarado","doi":"10.1109/ICDMW.2008.135","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.135","url":null,"abstract":"Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known K-means clustering algorithm. We present SQL query optimizations and a scalar UDF to compute Euclidean distance. We experimentally evaluate performance and scalability of our proposed SQL queries and UDF with large data sets on a modern DBMS. We benchmark distance computation on two important data mining techniques: clustering and classification. In general, UDFs are faster than SQL queries because they are executed in main memory. Data set size is the main factor impacting performance, followed by data set dimensionality.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"514 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116207931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an interactive system to query, explore and navigate data according to a hierarchical knowledge model that had been automatically populated from unstructured textual data. Our system differs from systems assisting in the navigation of domain ontologies and mining between pairs of concepts in that it enables access to unstructured data by abstract concepts and relations between them. Concepts in turn are specified by sets of models and their relations. However, some concepts may not have a direct representation in the text. In particular, the demonstration query by model/cancer (QbM/C) is based on unstructured pathology reports. The knowledge model represents both named entities such as diagnosis and anatomical site, and higher level concepts such as primary and metastatic tumor. Such concepts are based on the relations between named entities. We will present the data layout and access mechanism from the GUI to the data.
{"title":"Interactive Exploration of Model-Based Automatically Extracted Data","authors":"A. Coden, I. Sominsky, M. Tanenblatt","doi":"10.1109/ICDMW.2008.34","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.34","url":null,"abstract":"We present an interactive system to query, explore and navigate data according to a hierarchical knowledge model that had been automatically populated from unstructured textual data. Our system differs from systems assisting in the navigation of domain ontologies and mining between pairs of concepts in that it enables access to unstructured data by abstract concepts and relations between them. Concepts in turn are specified by sets of models and their relations. However, some concepts may not have a direct representation in the text. In particular, the demonstration query by model/cancer (QbM/C) is based on unstructured pathology reports. The knowledge model represents both named entities such as diagnosis and anatomical site, and higher level concepts such as primary and metastatic tumor. Such concepts are based on the relations between named entities. We will present the data layout and access mechanism from the GUI to the data.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116432400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A variety of services have recently been provided depending on highly developed networks and personal equipment. With these advances, connecting this equipment has become increasingly more complicated. Problems such as an increase in no-connection and determining the cause have become difficult in some cases because software is often updated to keep up with advancements in services or security. Telecom operators must understand the situation and act as quickly as possible when they receive customer enquiries. In this paper, we propose a method for analyzing and classifying customer enquiries that enables quick and efficient responses. This method is based upon a dependency parsing and co-occurrence technique to enable classification of a large amount of unstructured data into patterns because customer enquiries are generally stored as unstructured textual data.
{"title":"Semantic Analysis Method for Unstructured Data in Telecom Services","authors":"M. Iwashita, K. Nishimatsu, S. Shimogawa","doi":"10.1109/ICDMW.2008.79","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.79","url":null,"abstract":"A variety of services have recently been provided depending on highly developed networks and personal equipment. With these advances, connecting this equipment has become increasingly more complicated. Problems such as an increase in no-connection and determining the cause have become difficult in some cases because software is often updated to keep up with advancements in services or security. Telecom operators must understand the situation and act as quickly as possible when they receive customer enquiries. In this paper, we propose a method for analyzing and classifying customer enquiries that enables quick and efficient responses. This method is based upon a dependency parsing and co-occurrence technique to enable classification of a large amount of unstructured data into patterns because customer enquiries are generally stored as unstructured textual data.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125406914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article introduces ARUBAS, a new framework to build associative classifiers. In contrast with many existing associative classifiers, it uses class association rules to transform the feature space and uses instance-based reasoning to classify new instances. The framework allows the researcher to use any association rule mining algorithm to produce the class association rules. Every aspect of the framework is extensively introduced and discussed and five different fitness measures used for classification purposes are defined. The empirical results determine which fitness measure is the best and compares the framework with other classifiers. These results show that the ARUBAS framework is able to produce associative classifiers which are competitive with other classification techniques. More specifically, with ARUBAS-Scheffer-phi5 we have introduced a parameter-free algorithm which is competitive with classification techniques such as C4.5, RIPPER and CBA.
{"title":"ARUBAS: An Association Rule Based Similarity Framework for Associative Classifiers","authors":"B. Depaire, K. Vanhoof, G. Wets","doi":"10.1109/ICDMW.2008.58","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.58","url":null,"abstract":"This article introduces ARUBAS, a new framework to build associative classifiers. In contrast with many existing associative classifiers, it uses class association rules to transform the feature space and uses instance-based reasoning to classify new instances. The framework allows the researcher to use any association rule mining algorithm to produce the class association rules. Every aspect of the framework is extensively introduced and discussed and five different fitness measures used for classification purposes are defined. The empirical results determine which fitness measure is the best and compares the framework with other classifiers. These results show that the ARUBAS framework is able to produce associative classifiers which are competitive with other classification techniques. More specifically, with ARUBAS-Scheffer-phi5 we have introduced a parameter-free algorithm which is competitive with classification techniques such as C4.5, RIPPER and CBA.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126165619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}