Sixth International Conference on Data Mining (ICDM'06)最新文献

英文中文

Entity Resolution with Markov Logic 马尔可夫逻辑的实体解析

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.65

Parag Singla, Pedro M. Domingos

Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components.

实体解析是确定数据库中哪些记录引用相同实体的问题，是数据挖掘过程中至关重要且代价高昂的一步。人们对它的兴趣迅速增长，并提出了许多方法。然而，它们往往只处理问题的孤立方面，而且往往是特别的。本文提出了一种基于马尔可夫逻辑的实体解析问题的综合解决方案。马尔可夫逻辑结合了一阶逻辑和概率图形模型，将权重附加到一阶公式中，并将其视为马尔可夫网络特征的模板。我们展示了如何在马尔可夫逻辑中制定和无缝结合许多先前的方法，以及如何有效地解决由此产生的学习和推理问题。在两个引文数据库上的实验表明了该方法的有效性，并评估了不同成分的贡献。

引用次数: 428

Active Learning to Maximize Area Under the ROC Curve 主动学习最大化ROC曲线下的面积

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.12

Matt Culver, Kun Deng, S. Scott

In active learning, a machine learning algorithm is given an unlabeled set of examples U, and is allowed to request labels for a relatively small subset of U to use for training. The goal is then to judiciously choose which examples in U to have labeled in order to optimize some performance criterion, e.g. classification accuracy. We study how active learning affects AUC. We examine two existing algorithms from the literature and present our own active learning algorithms designed to maximize the AUC of the hypothesis. One of our algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it. When good posterior probability estimates were available, our heuristics were by far the best.

在主动学习中，机器学习算法被给予一组未标记的示例U，并允许为相对较小的U子集请求标签以用于训练。然后，目标是明智地选择U中的哪些示例进行标记，以优化某些性能标准，例如分类准确性。我们研究主动学习如何影响AUC。我们从文献中研究了两种现有的算法，并提出了我们自己的主动学习算法，旨在最大化假设的AUC。我们的一种算法一直是表现最好的，而文献中的“最接近抽样”(nearest Sampling)常常排在第二位。当良好的后验概率估计可用时，我们的启发式是迄今为止最好的。

引用次数: 39

CoMiner: An Effective Algorithm for Mining Competitors from the Web CoMiner:从网络中挖掘竞争对手的有效算法

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.38

Rui-gang Li, Shenghua Bao, Jin Wang, Yong Yu, Yunbo Cao

This paper attempts to accomplish a novel task of mining competitive information with respect to an entity (such as a company, product, person) from the web. An algorithm called "CoMiner" is proposed, which first extracts a set of comparative candidates of the input entity and then ranks them according to the comparability, and finally extracts the competitive fields. The experimental results show that the proposed algorithm drafts a complete picture of competitive relation of a given entity effectively.

本文试图完成一项新的任务，即从网络中挖掘关于一个实体(如公司、产品、个人)的竞争信息。提出了一种名为“CoMiner”的算法，该算法首先提取一组输入实体的比较候选项，然后根据可比性对它们进行排序，最后提取竞争字段。实验结果表明，该算法能有效地描绘出给定实体竞争关系的全貌。

引用次数: 34

Frequent Closed Itemset Mining Using Prefix Graphs with an Efficient Flow-Based Pruning Strategy 基于高效流剪枝策略的前缀图频繁闭项集挖掘

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.74

H. Moonesinghe, S. Fodeh, P. Tan

This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets. Our approach consists of constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph. The main advantage of this representation is that the bit vectors at each node are relatively shorter than those produced by existing vertical mining methods. This facilitates fast frequency counting of itemsets via intersection operations. We also devise several inter- node and intra-node pruning strategies to substantially reduce the combinatorial search space. Unlike other existing approaches, we do not need to store in memory the entire set of closed itemsets that have been mined so far in order to check whether a candidate itemset is closed. This dramatically reduces the memory usage of our algorithm, especially for low support thresholds. Our experiments using synthetic and real-world data sets show that PGMiner outperforms existing mining algorithms by as much as an order of magnitude and is scalable to very large databases.

提出了一种新的基于图的频繁闭项集挖掘算法PGMiner。我们的方法包括构造一个前缀图结构，并将数据库分解为可变长度的位向量，这些位向量分配给图的节点。这种表示的主要优点是每个节点上的位向量比现有垂直挖掘方法产生的位向量相对短。这有助于通过交叉操作快速计数项目集的频率。我们还设计了一些节点间和节点内的修剪策略，以大大减少组合搜索空间。与其他现有方法不同，我们不需要在内存中存储到目前为止已经挖掘的整个封闭项目集集，以便检查候选项目集是否关闭。这极大地减少了我们算法的内存使用，特别是对于低支持阈值。我们使用合成数据集和真实世界数据集进行的实验表明，PGMiner比现有的挖掘算法高出一个数量级，并且可以扩展到非常大的数据库。

引用次数: 35

delta-Tolerance Closed Frequent Itemsets delta公差闭频繁项集

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.1

James Cheng, Yiping Ke, Wilfred Ng

In this paper, we study an inherent problem of mining frequent itemsets (FIs): the number of FIs mined is often too large. The large number of FIs not only affects the mining performance, but also severely thwarts the application of FI mining. In the literature, Closed FIs (CFIs) and Maximal FIs (MFIs) are proposed as concise representations of FIs. However, the number of CFIs is still too large in many cases, while MFIs lose information about the frequency of the FIs. To address this problem, we relax the restrictive definition of CFIs and propose the (delta-Tolerance CFIs delta- TCFIs). Mining delta-TCFIs recursively removes all subsets of a delta-TCFI that fall within a frequency distance bounded by delta. We propose two algorithms, CFI2TCFI and MineTCFI, to mine delta-TCFIs. CFI2TCFI achieves very high accuracy on the estimated frequency of the recovered FIs but is less efficient when the number of CFIs is large, since it is based on CFI mining. MineTCFI is significantly faster and consumes less memory than the algorithms of the state-of-the-art concise representations of FIs, while the accuracy of MineTCFI is only slightly lower than that of CFI2TCFI.

本文研究了频繁项集挖掘的一个固有问题:频繁项集挖掘的数量往往太大。大量的FI不仅影响了挖掘性能，而且严重阻碍了FI挖掘的应用。在文献中，封闭式金融机构(CFIs)和最大金融机构(mfi)被提出作为金融机构的简明表示。然而，在许多情况下，金融服务机构的数量仍然过多，而小额信贷机构则失去了有关金融服务机构频率的信息。为了解决这一问题，我们放宽了cfi的限制性定义，提出了(delta- tolerance) cfi (delta- tcfi)。挖掘delta- tcfi递归地去除delta- tcfi的所有子集，这些子集落在delta限定的频率距离内。我们提出了CFI2TCFI和MineTCFI两种算法来挖掘delta- tcfi。CFI2TCFI对恢复的fi的估计频率达到了非常高的准确性，但当CFI数量很大时效率较低，因为它是基于CFI挖掘的。MineTCFI比最先进的fi简洁表示算法要快得多，消耗的内存也少得多，而MineTCFI的精度仅略低于CFI2TCFI。

引用次数: 47

What is the Dimension of Your Binary Data? 二进制数据的维数是多少?

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.167

Nikolaj Tatti, Taneli Mielikäinen, A. Gionis, H. Mannila

Many 0/1 datasets have a very large number of variables; however, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset D, its normalized fractal dimension counts the number of independent columns needed to achieve the unnormalized fractal dimension of D. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against PCA.

许多0/1数据集有非常多的变量;然而，它们是稀疏的，变量的依赖结构比变量的数量要简单得多。定义这样一个数据集的有效维数是一个非常重要的问题。我们考虑了0/1数据集的鲁棒维数的定义问题，并证明了分形维数的基本思想可以适用于二进制数据。然而，这样的分形维数很难解释。因此，我们引入了归一化分形维数的概念。对于数据集D，它的归一化分形维数计算了达到D的非归一化分形维数所需的独立列数，归一化分形维数度量了数据的依赖结构的程度。研究了归一化分形维数的性质，并讨论了其计算方法。给出了归一化分形维数的实证结果，并与主成分分析法进行了比较。

引用次数: 50

Bayesian State Space Modeling Approach for Measuring the Effectiveness of Marketing Activities and Baseline Sales from POS Data 基于POS数据的营销活动有效性和基线销售的贝叶斯状态空间建模方法

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.25

T. Ando

Analysis of point of sales (POS) data is an important research area of marketing science and knowledge discovery, which may enable marketing managers to attain the effective marketing activities. To measure the effectiveness of marketing activities and baseline sales, we develop the multivariate time series modeling method in the framework of a general state space model. A multivariate Poisson model and a multivariate correlated auto-regressive model are used for a system model and an observation model. The Bayesian approach via Markov Chain Monte Carlo (MCMC) algorithm is employed for estimating model parameters. To evaluate the goodness of the estimated models, the Bayesian predictive information criterion is utilized. The proposed model is evaluated with its application to actual POS data.

销售点数据分析是市场营销科学和知识发现的一个重要研究领域，它可以使营销管理者有效地进行营销活动。为了衡量营销活动和基线销售的有效性，我们在一般状态空间模型的框架中开发了多变量时间序列建模方法。系统模型和观测模型分别采用多元泊松模型和多元相关自回归模型。采用基于马尔可夫链蒙特卡罗(MCMC)算法的贝叶斯方法估计模型参数。利用贝叶斯预测信息准则来评价估计模型的优劣。通过对实际POS数据的应用，对该模型进行了评价。

引用次数: 11

Decision Trees for Functional Variables 函数变量的决策树

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.49

Suhrid Balakrishnan, D. Madigan

Classification problems with functionally structured input variables arise naturally in many applications. In a clinical domain, for example, input variables could include a time series of blood pressure measurements. In a financial setting, different time series of stock returns might serve as predictors. In an archaeological application, the 2D profile of an artifact may serve as a key input variable. In such domains, accuracy of the classifier is not the only reasonable goal to strive for; classifiers that provide easily interpretable results are also of value. In this work, we present an intuitive scheme for extending decision trees to handle functional input variables. Our results show that such decision trees are both accurate and readily interpretable.

在许多应用程序中，使用功能结构化输入变量的分类问题自然会出现。例如，在临床领域，输入变量可能包括血压测量的时间序列。在金融环境中，不同时间序列的股票收益可以作为预测指标。在考古应用程序中，工件的2D轮廓可以作为关键输入变量。在这些领域中，分类器的准确性并不是唯一合理的目标;提供易于解释的结果的分类器也很有价值。在这项工作中，我们提出了一个直观的方案来扩展决策树来处理函数输入变量。我们的研究结果表明，这种决策树既准确又易于解释。

引用次数: 22

A Parameterized Probabilistic Model of Network Evolution for Supervised Link Prediction 一种用于监督链路预测的网络演化参数化概率模型

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.8

H. Kashima, N. Abe

We introduce a new approach to the problem of link prediction for network structured domains, such as the Web, social networks, and biological networks. Our approach is based on the topological features of network structures, not on the node features. We present a novel parameterized probabilistic model of network evolution and derive an efficient incremental learning algorithm for such models, which is then used to predict links among the nodes. We show some promising experimental results using biological network data sets.

我们介绍了一种新的方法来解决网络结构化领域的链接预测问题，如Web、社交网络和生物网络。我们的方法是基于网络结构的拓扑特征，而不是基于节点特征。我们提出了一种新的网络演化的参数化概率模型，并推导了一种有效的增量学习算法，然后用于预测节点之间的链接。我们使用生物网络数据集展示了一些有希望的实验结果。

引用次数: 147

Large Scale Detection of Irregularities in Accounting Data 会计数据违规的大规模检测

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.93

Stephen D. Bay, K. Kumaraswamy, M. Anderle, Rohit Kumar, D. Steier

In recent years, there have been several large accounting frauds where a company's financial results have been intentionally misrepresented by billions of dollars. In response, regulatory bodies have mandated that auditors perform analytics on detailed financial data with the intent of discovering such misstatements. For a large auditing firm, this may mean analyzing millions of records from thousands of clients. This paper proposes techniques for automatic analysis of company general ledgers on such a large scale, identifying irregularities - which may indicate fraud or just honest errors - for additional review by auditors. These techniques have been implemented in a prototype system, called Sherlock, which combines aspects of both outlier detection and classification. In developing Sherlock, we faced three major challenges: developing an efficient process for obtaining data from many heterogeneous sources, training classifiers with only positive and unlabeled examples, and presenting information to auditors in an easily interpretable manner. In this paper, we describe how we addressed these challenges over the past two years and report on experiments evaluating Sherlock.

近年来，出现了几起大型会计欺诈事件，一家公司的财务业绩被故意歪曲了数十亿美元。作为回应，监管机构要求审计师对详细的财务数据进行分析，以发现此类错报。对于一家大型审计公司来说，这可能意味着要分析来自数千个客户的数百万条记录。本文提出了对如此大规模的公司总分类账进行自动分析的技术，识别违规行为——可能表明欺诈或只是诚实的错误——以供审计师进行额外审查。这些技术已经在一个名为Sherlock的原型系统中实现，该系统结合了离群值检测和分类的各个方面。在开发Sherlock的过程中，我们面临着三个主要挑战:开发一种从许多异构源获取数据的有效流程，仅使用正示例和未标记示例训练分类器，以及以易于解释的方式向审计员呈现信息。在本文中，我们描述了在过去两年中我们是如何应对这些挑战的，并报告了评估夏洛克的实验。

{"title":"Large Scale Detection of Irregularities in Accounting Data","authors":"Stephen D. Bay, K. Kumaraswamy, M. Anderle, Rohit Kumar, D. Steier","doi":"10.1109/ICDM.2006.93","DOIUrl":"https://doi.org/10.1109/ICDM.2006.93","url":null,"abstract":"In recent years, there have been several large accounting frauds where a company's financial results have been intentionally misrepresented by billions of dollars. In response, regulatory bodies have mandated that auditors perform analytics on detailed financial data with the intent of discovering such misstatements. For a large auditing firm, this may mean analyzing millions of records from thousands of clients. This paper proposes techniques for automatic analysis of company general ledgers on such a large scale, identifying irregularities - which may indicate fraud or just honest errors - for additional review by auditors. These techniques have been implemented in a prototype system, called Sherlock, which combines aspects of both outlier detection and classification. In developing Sherlock, we faced three major challenges: developing an efficient process for obtaining data from many heterogeneous sources, training classifiers with only positive and unlabeled examples, and presenting information to auditors in an easily interpretable manner. In this paper, we describe how we addressed these challenges over the past two years and report on experiments evaluating Sherlock.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125591227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Sixth International Conference on Data Mining (ICDM'06)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀