Kelvin Sim, Jinyan Li, V. Gopalkrishnan, Guimei Liu
We introduce an unsupervised process to co-cluster groups of stocks and financial ratios, so that investors can gain more insight on how they are correlated. Our idea for the co-clustering is based on a graph concept called maximal quasi-bicliques, which can tolerate erroneous or/and missing information that are common in the stock and financial ratio data. Compared to previous works, our maximal quasi-bicliques require the errors to be evenly distributed, which enable us to capture more meaningful co-clusters. We develop a new algorithm that can efficiently enumerate maximal quasi-bicliques from an undirected graph. The concept of maximal quasi-bicliques is domain-independent; it can be extended to perform co-clustering on any set of data that are modeled by graphs.
{"title":"Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment","authors":"Kelvin Sim, Jinyan Li, V. Gopalkrishnan, Guimei Liu","doi":"10.1109/ICDM.2006.111","DOIUrl":"https://doi.org/10.1109/ICDM.2006.111","url":null,"abstract":"We introduce an unsupervised process to co-cluster groups of stocks and financial ratios, so that investors can gain more insight on how they are correlated. Our idea for the co-clustering is based on a graph concept called maximal quasi-bicliques, which can tolerate erroneous or/and missing information that are common in the stock and financial ratio data. Compared to previous works, our maximal quasi-bicliques require the errors to be evenly distributed, which enable us to capture more meaningful co-clusters. We develop a new algorithm that can efficiently enumerate maximal quasi-bicliques from an undirected graph. The concept of maximal quasi-bicliques is domain-independent; it can be extended to perform co-clustering on any set of data that are modeled by graphs.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122759454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Empirical studies on supervised learning have shown that ensembling methods lead to a model superior to the one built from a single learner under many circumstances especially when learning from imperfect, such as biased or noise infected, information sources. In this paper, we provide a novel corrective classification (C2) design, which incorporates error detection, data cleansing and Bootstrap sampling to construct base learners that constitute the classifier ensemble. The essential goal is to reduce noise impacts and eventually enhance the learners built from noise corrupted data. We further analyze the importance of both the accuracy and diversity of base learners in ensembling, in order to shed some light on the mechanism under which C2 works. Experimental comparisons will demonstrate that C2 is not only superior to the learner built from the original noisy sources, but also more reliable than bagging or the aggressive classifier ensemble (ACE), which are two degenerate components/variants of C2.
{"title":"Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners","authors":"Yan Zhang, Xingquan Zhu, Xindong Wu","doi":"10.1109/ICDM.2006.45","DOIUrl":"https://doi.org/10.1109/ICDM.2006.45","url":null,"abstract":"Empirical studies on supervised learning have shown that ensembling methods lead to a model superior to the one built from a single learner under many circumstances especially when learning from imperfect, such as biased or noise infected, information sources. In this paper, we provide a novel corrective classification (C2) design, which incorporates error detection, data cleansing and Bootstrap sampling to construct base learners that constitute the classifier ensemble. The essential goal is to reduce noise impacts and eventually enhance the learners built from noise corrupted data. We further analyze the importance of both the accuracy and diversity of base learners in ensembling, in order to shed some light on the mechanism under which C2 works. Experimental comparisons will demonstrate that C2 is not only superior to the learner built from the original noisy sources, but also more reliable than bagging or the aggressive classifier ensemble (ACE), which are two degenerate components/variants of C2.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125157118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In traditional clustering, every data point is assigned to at least one cluster. On the other extreme, one class clustering algorithms proposed recently identify a single dense cluster and consider the rest of the data as irrelevant. However, in many problems, the relevant data forms multiple natural clusters. In this paper, we introduce the notion of Bregman bubbles and propose Bregman bubble clustering (BBC) that seeks k dense Bregman bubbles in the data. We also present a corresponding generative model, soft BBC, and show several connections with Bregman clustering, and with a one class clustering algorithm. Empirical results on various datasets show the effectiveness of our method.
{"title":"Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data","authors":"Gunjan Gupta, Joydeep Ghosh","doi":"10.1109/ICDM.2006.32","DOIUrl":"https://doi.org/10.1109/ICDM.2006.32","url":null,"abstract":"In traditional clustering, every data point is assigned to at least one cluster. On the other extreme, one class clustering algorithms proposed recently identify a single dense cluster and consider the rest of the data as irrelevant. However, in many problems, the relevant data forms multiple natural clusters. In this paper, we introduce the notion of Bregman bubbles and propose Bregman bubble clustering (BBC) that seeks k dense Bregman bubbles in the data. We also present a corresponding generative model, soft BBC, and show several connections with Bregman clustering, and with a one class clustering algorithm. Empirical results on various datasets show the effectiveness of our method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132814870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A novel scheme for item-based recommendation is proposed in this paper. In our framework, the items are described by an undirected weighted graph Q = (V,epsiv). V is the node set which is identical to the item set, and epsiv is the edge set. Associate with each edge eij isin epsiv is a weight omegaij ges 0, which represents similarity between items i and j. Without the loss of generality, we assume that any user's ratings to the items should be sufficiently smooth with respect to the intrinsic structure of the items, i.e., a user should give similar ratings to similar items. A simple algorithm is presented to achieve such a smooth solution. Encouraging experimental results are provided to show the effectiveness of our method.
{"title":"Recommendation on Item Graphs","authors":"Fei Wang, Shengchao Ma, Liuzhong Yang, Ta-Hsin Li","doi":"10.1109/ICDM.2006.133","DOIUrl":"https://doi.org/10.1109/ICDM.2006.133","url":null,"abstract":"A novel scheme for item-based recommendation is proposed in this paper. In our framework, the items are described by an undirected weighted graph Q = (V,epsiv). V is the node set which is identical to the item set, and epsiv is the edge set. Associate with each edge eij isin epsiv is a weight omegaij ges 0, which represents similarity between items i and j. Without the loss of generality, we assume that any user's ratings to the items should be sufficiently smooth with respect to the intrinsic structure of the items, i.e., a user should give similar ratings to similar items. A simple algorithm is presented to achieve such a smooth solution. Encouraging experimental results are provided to show the effectiveness of our method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. One of the special challenges for spatial data mining is that information is usually not uniformly distributed in spatial datasets. Consequently, the discovery of regional knowledge is of fundamental importance for spatial data mining. This paper centers on discovering regional association rules in spatial datasets. In particular, we introduce a novel framework to mine regional association rules relying on a given class structure. A reward-based regional discovery methodology is introduced, and a divisive, grid-based supervised clustering algorithm is presented that identifies interesting subregions in spatial datasets. Then, an integrated approach is discussed to systematically mine regional rules. The proposed framework is evaluated in a real-world case study that identifies spatial risk patterns from arsenic in the Texas water supply.
{"title":"A Framework for Regional Association Rule Mining in Spatial Datasets","authors":"W. Ding, C. Eick, Jing Wang, Xiaojing Yuan","doi":"10.1109/ICDM.2006.5","DOIUrl":"https://doi.org/10.1109/ICDM.2006.5","url":null,"abstract":"The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. One of the special challenges for spatial data mining is that information is usually not uniformly distributed in spatial datasets. Consequently, the discovery of regional knowledge is of fundamental importance for spatial data mining. This paper centers on discovering regional association rules in spatial datasets. In particular, we introduce a novel framework to mine regional association rules relying on a given class structure. A reward-based regional discovery methodology is introduced, and a divisive, grid-based supervised clustering algorithm is presented that identifies interesting subregions in spatial datasets. Then, an integrated approach is discussed to systematically mine regional rules. The proposed framework is evaluated in a real-world case study that identifies spatial risk patterns from arsenic in the Texas water supply.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132103891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text mining is the technique that helps users find useful information from a large amount of digital text documents on the Web or databases. Instead of the keyword-based approach which is typically used in this field, the pattern-based model containing frequent sequential patterns is employed to perform the same concept of tasks. However, how to effectively use these discovered patterns is still a big challenge. In this study, we propose two approaches based on the use of pattern deploying strategies. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCVI and the results show that the effectiveness is improved by using our proposed pattern refinement approaches.
{"title":"Deploying Approaches for Pattern Refinement in Text Mining","authors":"Sheng-Tang Wu, Yuefeng Li, Yue Xu","doi":"10.1109/ICDM.2006.50","DOIUrl":"https://doi.org/10.1109/ICDM.2006.50","url":null,"abstract":"Text mining is the technique that helps users find useful information from a large amount of digital text documents on the Web or databases. Instead of the keyword-based approach which is typically used in this field, the pattern-based model containing frequent sequential patterns is employed to perform the same concept of tasks. However, how to effectively use these discovered patterns is still a big challenge. In this study, we propose two approaches based on the use of pattern deploying strategies. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCVI and the results show that the effectiveness is improved by using our proposed pattern refinement approaches.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.
{"title":"Adaptive Blocking: Learning to Scale Up Record Linkage","authors":"M. Bilenko, B. Kamath, R. Mooney","doi":"10.1109/ICDM.2006.13","DOIUrl":"https://doi.org/10.1109/ICDM.2006.13","url":null,"abstract":"Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133049729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A bottleneck to detecting distance and density based outliers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pairwise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this approximation is 0(Rn log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.
{"title":"An Efficient Reference-Based Approach to Outlier Detection in Large Datasets","authors":"Yaling Pei, Osmar R Zaiane, Yong Gao","doi":"10.1109/ICDM.2006.17","DOIUrl":"https://doi.org/10.1109/ICDM.2006.17","url":null,"abstract":"A bottleneck to detecting distance and density based outliers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pairwise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this approximation is 0(Rn log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132587818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid growth of blog (also known as "weblog") data provides a rich resource for social community mining. In this paper, we put forward a novel research problem of mining the latent friends of bloggers based on the contents of their blog entries. Latent friends are defined in this paper as people who share the similar topic distribution in their blogs. These people may not actually know each other, but they have the interest and potential to find each other out. Three approaches are designed for latent friend detection. The first one, called cosine similarity-based method, determines the similarity between bloggers by calculating the cosine similarity between the contents of the blogs. The second approach, known as topic-based method, is based on the discovery of latent topics using a latent topic model and then calculating the similarity at the topic level. The third one is two-level similarity-based, which is conducted in two stages. In the first stage, an existing topic hierarchy is exploited to build a topic distribution for a blogger. Then, in the second stage, a detailed similarity comparison is conducted for bloggers that are close in interest to each other which are discovered in the first stage. Our experimental results show that both the topic-based and two-level similarity-based methods work well, and the last approach performs much better than the first two. In this paper, we give a detailed analysis of the advantages and disadvantages of different approaches.
{"title":"Latent Friend Mining from Blog Data","authors":"Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen","doi":"10.1109/ICDM.2006.95","DOIUrl":"https://doi.org/10.1109/ICDM.2006.95","url":null,"abstract":"The rapid growth of blog (also known as \"weblog\") data provides a rich resource for social community mining. In this paper, we put forward a novel research problem of mining the latent friends of bloggers based on the contents of their blog entries. Latent friends are defined in this paper as people who share the similar topic distribution in their blogs. These people may not actually know each other, but they have the interest and potential to find each other out. Three approaches are designed for latent friend detection. The first one, called cosine similarity-based method, determines the similarity between bloggers by calculating the cosine similarity between the contents of the blogs. The second approach, known as topic-based method, is based on the discovery of latent topics using a latent topic model and then calculating the similarity at the topic level. The third one is two-level similarity-based, which is conducted in two stages. In the first stage, an existing topic hierarchy is exploited to build a topic distribution for a blogger. Then, in the second stage, a detailed similarity comparison is conducted for bloggers that are close in interest to each other which are discovered in the first stage. Our experimental results show that both the topic-based and two-level similarity-based methods work well, and the last approach performs much better than the first two. In this paper, we give a detailed analysis of the advantages and disadvantages of different approaches.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128657377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present URIES - an unsupervised relation identification and extraction system. The system automatically identifies interesting binary relations between entities in the input corpus, and then proceeds to extract a large number of instances of these relations. The system discovers relations by clustering frequently co- occuring pairs of entities, based on the contexts in which they appear. Its complex pattern-based representation of the contexts allows the clustering step to achieve very high precision, sufficient for the clusters to perform as sets of seeds for bootstrapping a high-recall relation extraction process. In a series of experiments we demonstrate the successful performance of URIES and compare it to the two existing systems - a weakly supervised high-recall Web relation extraction system called SRES, and an unsupervised relation identification system that uses a simpler bag-ofwords representation of contexts. The experiments show that URIES performs comparably to SRES, but without any supervision, and that such performance is due to the power of its complex contexts representation and to its novel candidate selection method.
{"title":"High-Performance Unsupervised Relation Extraction from Large Corpora","authors":"Binyamin Rosenfeld, Ronen Feldman","doi":"10.1109/ICDM.2006.82","DOIUrl":"https://doi.org/10.1109/ICDM.2006.82","url":null,"abstract":"We present URIES - an unsupervised relation identification and extraction system. The system automatically identifies interesting binary relations between entities in the input corpus, and then proceeds to extract a large number of instances of these relations. The system discovers relations by clustering frequently co- occuring pairs of entities, based on the contexts in which they appear. Its complex pattern-based representation of the contexts allows the clustering step to achieve very high precision, sufficient for the clusters to perform as sets of seeds for bootstrapping a high-recall relation extraction process. In a series of experiments we demonstrate the successful performance of URIES and compare it to the two existing systems - a weakly supervised high-recall Web relation extraction system called SRES, and an unsupervised relation identification system that uses a simpler bag-ofwords representation of contexts. The experiments show that URIES performs comparably to SRES, but without any supervision, and that such performance is due to the power of its complex contexts representation and to its novel candidate selection method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114737430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}