Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components.
{"title":"Entity Resolution with Markov Logic","authors":"Parag Singla, Pedro M. Domingos","doi":"10.1109/ICDM.2006.65","DOIUrl":"https://doi.org/10.1109/ICDM.2006.65","url":null,"abstract":"Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"34 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120980933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In active learning, a machine learning algorithm is given an unlabeled set of examples U, and is allowed to request labels for a relatively small subset of U to use for training. The goal is then to judiciously choose which examples in U to have labeled in order to optimize some performance criterion, e.g. classification accuracy. We study how active learning affects AUC. We examine two existing algorithms from the literature and present our own active learning algorithms designed to maximize the AUC of the hypothesis. One of our algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it. When good posterior probability estimates were available, our heuristics were by far the best.
{"title":"Active Learning to Maximize Area Under the ROC Curve","authors":"Matt Culver, Kun Deng, S. Scott","doi":"10.1109/ICDM.2006.12","DOIUrl":"https://doi.org/10.1109/ICDM.2006.12","url":null,"abstract":"In active learning, a machine learning algorithm is given an unlabeled set of examples U, and is allowed to request labels for a relatively small subset of U to use for training. The goal is then to judiciously choose which examples in U to have labeled in order to optimize some performance criterion, e.g. classification accuracy. We study how active learning affects AUC. We examine two existing algorithms from the literature and present our own active learning algorithms designed to maximize the AUC of the hypothesis. One of our algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it. When good posterior probability estimates were available, our heuristics were by far the best.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122781047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui-gang Li, Shenghua Bao, Jin Wang, Yong Yu, Yunbo Cao
This paper attempts to accomplish a novel task of mining competitive information with respect to an entity (such as a company, product, person) from the web. An algorithm called "CoMiner" is proposed, which first extracts a set of comparative candidates of the input entity and then ranks them according to the comparability, and finally extracts the competitive fields. The experimental results show that the proposed algorithm drafts a complete picture of competitive relation of a given entity effectively.
{"title":"CoMiner: An Effective Algorithm for Mining Competitors from the Web","authors":"Rui-gang Li, Shenghua Bao, Jin Wang, Yong Yu, Yunbo Cao","doi":"10.1109/ICDM.2006.38","DOIUrl":"https://doi.org/10.1109/ICDM.2006.38","url":null,"abstract":"This paper attempts to accomplish a novel task of mining competitive information with respect to an entity (such as a company, product, person) from the web. An algorithm called \"CoMiner\" is proposed, which first extracts a set of comparative candidates of the input entity and then ranks them according to the comparability, and finally extracts the competitive fields. The experimental results show that the proposed algorithm drafts a complete picture of competitive relation of a given entity effectively.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122859477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets. Our approach consists of constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph. The main advantage of this representation is that the bit vectors at each node are relatively shorter than those produced by existing vertical mining methods. This facilitates fast frequency counting of itemsets via intersection operations. We also devise several inter- node and intra-node pruning strategies to substantially reduce the combinatorial search space. Unlike other existing approaches, we do not need to store in memory the entire set of closed itemsets that have been mined so far in order to check whether a candidate itemset is closed. This dramatically reduces the memory usage of our algorithm, especially for low support thresholds. Our experiments using synthetic and real-world data sets show that PGMiner outperforms existing mining algorithms by as much as an order of magnitude and is scalable to very large databases.
{"title":"Frequent Closed Itemset Mining Using Prefix Graphs with an Efficient Flow-Based Pruning Strategy","authors":"H. Moonesinghe, S. Fodeh, P. Tan","doi":"10.1109/ICDM.2006.74","DOIUrl":"https://doi.org/10.1109/ICDM.2006.74","url":null,"abstract":"This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets. Our approach consists of constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph. The main advantage of this representation is that the bit vectors at each node are relatively shorter than those produced by existing vertical mining methods. This facilitates fast frequency counting of itemsets via intersection operations. We also devise several inter- node and intra-node pruning strategies to substantially reduce the combinatorial search space. Unlike other existing approaches, we do not need to store in memory the entire set of closed itemsets that have been mined so far in order to check whether a candidate itemset is closed. This dramatically reduces the memory usage of our algorithm, especially for low support thresholds. Our experiments using synthetic and real-world data sets show that PGMiner outperforms existing mining algorithms by as much as an order of magnitude and is scalable to very large databases.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114506147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study an inherent problem of mining frequent itemsets (FIs): the number of FIs mined is often too large. The large number of FIs not only affects the mining performance, but also severely thwarts the application of FI mining. In the literature, Closed FIs (CFIs) and Maximal FIs (MFIs) are proposed as concise representations of FIs. However, the number of CFIs is still too large in many cases, while MFIs lose information about the frequency of the FIs. To address this problem, we relax the restrictive definition of CFIs and propose the (delta-Tolerance CFIs delta- TCFIs). Mining delta-TCFIs recursively removes all subsets of a delta-TCFI that fall within a frequency distance bounded by delta. We propose two algorithms, CFI2TCFI and MineTCFI, to mine delta-TCFIs. CFI2TCFI achieves very high accuracy on the estimated frequency of the recovered FIs but is less efficient when the number of CFIs is large, since it is based on CFI mining. MineTCFI is significantly faster and consumes less memory than the algorithms of the state-of-the-art concise representations of FIs, while the accuracy of MineTCFI is only slightly lower than that of CFI2TCFI.
{"title":"delta-Tolerance Closed Frequent Itemsets","authors":"James Cheng, Yiping Ke, Wilfred Ng","doi":"10.1109/ICDM.2006.1","DOIUrl":"https://doi.org/10.1109/ICDM.2006.1","url":null,"abstract":"In this paper, we study an inherent problem of mining frequent itemsets (FIs): the number of FIs mined is often too large. The large number of FIs not only affects the mining performance, but also severely thwarts the application of FI mining. In the literature, Closed FIs (CFIs) and Maximal FIs (MFIs) are proposed as concise representations of FIs. However, the number of CFIs is still too large in many cases, while MFIs lose information about the frequency of the FIs. To address this problem, we relax the restrictive definition of CFIs and propose the (delta-Tolerance CFIs delta- TCFIs). Mining delta-TCFIs recursively removes all subsets of a delta-TCFI that fall within a frequency distance bounded by delta. We propose two algorithms, CFI2TCFI and MineTCFI, to mine delta-TCFIs. CFI2TCFI achieves very high accuracy on the estimated frequency of the recovered FIs but is less efficient when the number of CFIs is large, since it is based on CFI mining. MineTCFI is significantly faster and consumes less memory than the algorithms of the state-of-the-art concise representations of FIs, while the accuracy of MineTCFI is only slightly lower than that of CFI2TCFI.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128204823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaj Tatti, Taneli Mielikäinen, A. Gionis, H. Mannila
Many 0/1 datasets have a very large number of variables; however, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset D, its normalized fractal dimension counts the number of independent columns needed to achieve the unnormalized fractal dimension of D. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against PCA.
{"title":"What is the Dimension of Your Binary Data?","authors":"Nikolaj Tatti, Taneli Mielikäinen, A. Gionis, H. Mannila","doi":"10.1109/ICDM.2006.167","DOIUrl":"https://doi.org/10.1109/ICDM.2006.167","url":null,"abstract":"Many 0/1 datasets have a very large number of variables; however, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset D, its normalized fractal dimension counts the number of independent columns needed to achieve the unnormalized fractal dimension of D. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against PCA.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130309085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analysis of point of sales (POS) data is an important research area of marketing science and knowledge discovery, which may enable marketing managers to attain the effective marketing activities. To measure the effectiveness of marketing activities and baseline sales, we develop the multivariate time series modeling method in the framework of a general state space model. A multivariate Poisson model and a multivariate correlated auto-regressive model are used for a system model and an observation model. The Bayesian approach via Markov Chain Monte Carlo (MCMC) algorithm is employed for estimating model parameters. To evaluate the goodness of the estimated models, the Bayesian predictive information criterion is utilized. The proposed model is evaluated with its application to actual POS data.
{"title":"Bayesian State Space Modeling Approach for Measuring the Effectiveness of Marketing Activities and Baseline Sales from POS Data","authors":"T. Ando","doi":"10.1109/ICDM.2006.25","DOIUrl":"https://doi.org/10.1109/ICDM.2006.25","url":null,"abstract":"Analysis of point of sales (POS) data is an important research area of marketing science and knowledge discovery, which may enable marketing managers to attain the effective marketing activities. To measure the effectiveness of marketing activities and baseline sales, we develop the multivariate time series modeling method in the framework of a general state space model. A multivariate Poisson model and a multivariate correlated auto-regressive model are used for a system model and an observation model. The Bayesian approach via Markov Chain Monte Carlo (MCMC) algorithm is employed for estimating model parameters. To evaluate the goodness of the estimated models, the Bayesian predictive information criterion is utilized. The proposed model is evaluated with its application to actual POS data.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130527146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification problems with functionally structured input variables arise naturally in many applications. In a clinical domain, for example, input variables could include a time series of blood pressure measurements. In a financial setting, different time series of stock returns might serve as predictors. In an archaeological application, the 2D profile of an artifact may serve as a key input variable. In such domains, accuracy of the classifier is not the only reasonable goal to strive for; classifiers that provide easily interpretable results are also of value. In this work, we present an intuitive scheme for extending decision trees to handle functional input variables. Our results show that such decision trees are both accurate and readily interpretable.
{"title":"Decision Trees for Functional Variables","authors":"Suhrid Balakrishnan, D. Madigan","doi":"10.1109/ICDM.2006.49","DOIUrl":"https://doi.org/10.1109/ICDM.2006.49","url":null,"abstract":"Classification problems with functionally structured input variables arise naturally in many applications. In a clinical domain, for example, input variables could include a time series of blood pressure measurements. In a financial setting, different time series of stock returns might serve as predictors. In an archaeological application, the 2D profile of an artifact may serve as a key input variable. In such domains, accuracy of the classifier is not the only reasonable goal to strive for; classifiers that provide easily interpretable results are also of value. In this work, we present an intuitive scheme for extending decision trees to handle functional input variables. Our results show that such decision trees are both accurate and readily interpretable.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123846731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a new approach to the problem of link prediction for network structured domains, such as the Web, social networks, and biological networks. Our approach is based on the topological features of network structures, not on the node features. We present a novel parameterized probabilistic model of network evolution and derive an efficient incremental learning algorithm for such models, which is then used to predict links among the nodes. We show some promising experimental results using biological network data sets.
{"title":"A Parameterized Probabilistic Model of Network Evolution for Supervised Link Prediction","authors":"H. Kashima, N. Abe","doi":"10.1109/ICDM.2006.8","DOIUrl":"https://doi.org/10.1109/ICDM.2006.8","url":null,"abstract":"We introduce a new approach to the problem of link prediction for network structured domains, such as the Web, social networks, and biological networks. Our approach is based on the topological features of network structures, not on the node features. We present a novel parameterized probabilistic model of network evolution and derive an efficient incremental learning algorithm for such models, which is then used to predict links among the nodes. We show some promising experimental results using biological network data sets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128138139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen D. Bay, K. Kumaraswamy, M. Anderle, Rohit Kumar, D. Steier
In recent years, there have been several large accounting frauds where a company's financial results have been intentionally misrepresented by billions of dollars. In response, regulatory bodies have mandated that auditors perform analytics on detailed financial data with the intent of discovering such misstatements. For a large auditing firm, this may mean analyzing millions of records from thousands of clients. This paper proposes techniques for automatic analysis of company general ledgers on such a large scale, identifying irregularities - which may indicate fraud or just honest errors - for additional review by auditors. These techniques have been implemented in a prototype system, called Sherlock, which combines aspects of both outlier detection and classification. In developing Sherlock, we faced three major challenges: developing an efficient process for obtaining data from many heterogeneous sources, training classifiers with only positive and unlabeled examples, and presenting information to auditors in an easily interpretable manner. In this paper, we describe how we addressed these challenges over the past two years and report on experiments evaluating Sherlock.
{"title":"Large Scale Detection of Irregularities in Accounting Data","authors":"Stephen D. Bay, K. Kumaraswamy, M. Anderle, Rohit Kumar, D. Steier","doi":"10.1109/ICDM.2006.93","DOIUrl":"https://doi.org/10.1109/ICDM.2006.93","url":null,"abstract":"In recent years, there have been several large accounting frauds where a company's financial results have been intentionally misrepresented by billions of dollars. In response, regulatory bodies have mandated that auditors perform analytics on detailed financial data with the intent of discovering such misstatements. For a large auditing firm, this may mean analyzing millions of records from thousands of clients. This paper proposes techniques for automatic analysis of company general ledgers on such a large scale, identifying irregularities - which may indicate fraud or just honest errors - for additional review by auditors. These techniques have been implemented in a prototype system, called Sherlock, which combines aspects of both outlier detection and classification. In developing Sherlock, we faced three major challenges: developing an efficient process for obtaining data from many heterogeneous sources, training classifiers with only positive and unlabeled examples, and presenting information to auditors in an easily interpretable manner. In this paper, we describe how we addressed these challenges over the past two years and report on experiments evaluating Sherlock.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125591227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}