Sixth International Conference on Data Mining (ICDM'06)最新文献

英文中文

Diverse Topic Phrase Extraction through Latent Semantic Analysis 基于潜在语义分析的多主题短语提取

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.61

Jilin Chen, Jun Yan, Benyu Zhang, Qiang Yang, Zheng Chen

We propose a novel algorithm for extracting diverse topic phrases in order to provide summary for large corpora. Previous works often ignore the importance of diversity and thus extract phrases crowded on some hot topics while failing to cover other less obvious but important topics. We solve this problem through document re-weighting and phrase diversification by using latent semantic analysis (LSA). Experiments on various datasets show that our new algorithm can improve relevance as well as diversity over different topics for topic phrase extraction problems.

为了为大型语料库提供摘要，提出了一种新的主题短语提取算法。以往的作品往往忽视了多样性的重要性，在一些热点话题上抽取了拥挤的短语，而没有涵盖其他不太明显但很重要的话题。我们利用潜在语义分析(latent semantic analysis, LSA)，通过文档重加权和短语多样化来解决这个问题。在各种数据集上的实验表明，我们的新算法可以提高不同主题的主题短语提取问题的相关性和多样性。

引用次数: 13

bitSPADE: A Lattice-based Sequential Pattern Mining Algorithm Using Bitmap Representation bitSPADE:一种使用位图表示的基于格的顺序模式挖掘算法

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.28

S. Aseervatham, A. Osmani, E. Viennet

Sequential pattern mining allows to discover temporal relationship between items within a database. The patterns can then be used to generate association rules. When the databases are very large, the execution speed and the memory usage of the mining algorithm become critical parameters. Previous research has focused on either one of the two parameters. In this paper, we present bitSPADE, a novel algorithm that combines the best features of SPAM, one of the fastest algorithm, and SPADE, one of the most memory efficient algorithm. Moreover, we introduce a new pruning strategy that enables bitSPADE to reach high performances. Experimental evaluations showed that bitSPADE ensures an efficient tradeoff between speed and memory usage by outperforming SPADE by both speed and memory usage factors more than 3.4 and SPAM by a memory consumption factor up to more than an order of magnitude.

顺序模式挖掘允许发现数据库中项目之间的时间关系。然后可以使用这些模式来生成关联规则。当数据库规模较大时，挖掘算法的执行速度和内存使用成为关键参数。以前的研究主要集中在这两个参数中的任何一个。在本文中，我们提出了一种新的算法bitSPADE，它结合了最快的算法之一SPAM和内存效率最高的算法之一SPADE的最佳特征。此外，我们引入了一种新的修剪策略，使bitSPADE达到高性能。实验评估表明，bitSPADE确保了速度和内存使用之间的有效权衡，其速度和内存使用系数都超过了SPADE 3.4，内存消耗系数超过了SPAM一个数量级。

引用次数: 33

Belief Propagation in Large, Highly Connected Graphs for 3D Part-Based Object Recognition 基于三维零件的物体识别中大型高连通图的信念传播

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.26

F. DiMaio, J. Shavlik

We describe a part-based object-recognition framework, specialized to mining complex 3D objects from detailed 3D images. Objects are modeled as a collection of parts together with a pairwise potential function. An efficient inference algorithm - based on belief propagation (BP) -finds the optimal layout of parts, given some input image. We introduce AggBP, a message aggregation scheme for BP, in which groups of messages are approximated as a single message. For objects consisting of N parts, we reduce CPU time and memory requirements from O(N2) to O(N). We apply AggBP on synthetic data as well as a real-world task identifying protein fragments in three-dimensional images. These experiments show that our improvements result in minimal loss in accuracy in significantly less time.

我们描述了一个基于零件的物体识别框架，专门用于从详细的3D图像中挖掘复杂的3D物体。对象被建模为带有成对势函数的部分集合。一种基于信念传播(BP)的高效推理算法在给定输入图像的情况下找到零件的最优布局。介绍了一种基于BP的消息聚合方案AggBP，该方案将多组消息近似为单个消息。对于由N个部件组成的对象，我们将CPU时间和内存需求从0 (N2)降低到O(N)。我们将AggBP应用于合成数据以及在三维图像中识别蛋白质片段的实际任务。这些实验表明，我们的改进在更短的时间内实现了最小的精度损失。

引用次数: 10

Getting the Most Out of Ensemble Selection 最大限度地利用合奏选择

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.76

R. Caruana, Art Munson, Alexandru Niculescu-Mizil

We investigate four previously unexplored aspects of ensemble selection, a procedure for building ensembles of classifiers. First we test whether adjusting model predictions to put them on a canonical scale makes the ensembles more effective. Second, we explore the performance of ensemble selection when different amounts of data are available for ensemble hillclimbing. Third, we quantify the benefit of ensemble selection's ability to optimize to arbitrary metrics. Fourth, we study the performance impact of pruning the number of models available for ensemble selection. Based on our results we present improved ensemble selection methods that double the benefit of the original method.

我们研究了先前未探索的集成选择的四个方面，这是一个构建分类器集成的过程。首先，我们测试调整模型预测以使它们处于标准尺度上是否会使集成更有效。其次，我们探讨了不同数据量的集成爬坡时集成选择的性能。第三，我们量化了集成选择优化到任意指标的能力的好处。第四，我们研究了裁剪可用于集成选择的模型数量对性能的影响。基于我们的研究结果，我们提出了改进的集成选择方法，使原始方法的效益翻了一番。

引用次数: 147

Object Identification with Constraints 带约束的对象识别

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.117

Steffen Rendle, L. Schmidt-Thieme

Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks.

对象识别旨在基于噪声属性识别同一对象的不同表示，例如在不同的在线商店中对同一产品的描述或在不同的出版物中对同一论文的引用。为了解决这个问题，已经提出了许多解决方案，几乎所有的解决方案都是基于一对对象的相似函数。虽然目前的相似度函数是从一组标记的训练数据中学习的，但没有使用标记数据给出的结构信息。通过制定一个对象识别的通用模型，我们展示了几乎任何提出的识别模型都可以很容易地扩展以满足结构约束。因此，我们提出了一个模型，该模型使用给定的结构信息作为两两约束来指导关于对象识别的集体决策，以及学习的相似性度量。我们通过公开和现实数据的经验实验表明，结合结构信息和基于属性的相似性极大地提高了对象识别任务的整体性能。

引用次数: 28

LOCI: Load Shedding through Class-Preserving Data Acquisition LOCI:通过保持类的数据采集来减少负载

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.100

Peng Wang, Haixun Wang, Wei Wang, Baile Shi, Philip S. Yu

An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes prediction with partial input. We take advantage of stream data's temporal locality -for example, readings from a temperature sensor usually do not change dramatically over a short period of time -for load shedding. We first show that temporal locality of the original data is preserved by our transform, then we utilize positive and negative knowledge about the data (which is of much smaller size than the data itself) for classification. We employ both analytical and empirical analysis to demonstrate the advantage of our approach.

以流形式提供的大量数据超出了我们的数据分析能力。在本文中，我们提出了一种新的减载方法，可以实现快速准确的流数据分类。我们对输入数据进行了转换，使其类信息集中在几个特征上，并引入了一个渐进式分类器，该分类器使用部分输入进行预测。我们利用流数据的时间局域性(例如，温度传感器的读数通常在短时间内不会发生显着变化)来减少负载。我们首先表明，我们的变换保留了原始数据的时间局部性，然后我们利用关于数据的正知识和负知识(比数据本身小得多)进行分类。我们采用分析和实证分析来证明我们的方法的优势。

引用次数: 0

Turning Clusters into Patterns: Rectangle-Based Discriminative Data Description 将聚类转化为模式:基于矩形的判别数据描述

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.163

Byron J. Gao, M. Ester

The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different trade-offs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations.

数据挖掘的最终目标是从海量数据中提取知识。知识理想地表示为人类可理解的模式，最终用户可以从中获得直觉和见解。然而，并不是所有的数据挖掘方法都能产生这种容易理解的知识，例如，大多数聚类算法输出点集作为聚类。在本文中，我们系统地研究了从集群中生成可解释模式的集群描述。我们介绍和分析新颖的描述格式，从而提高表达能力，激发和定义新颖的描述问题，指定可解释性和准确性之间的不同权衡。我们还提出了有效的启发式算法及其经验评估。

引用次数: 19

Semantic Smoothing for Model-based Document Clustering 基于模型的文档聚类的语义平滑

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.142

Xiaodan Zhang, Xiaohua Zhou, Xiaohua Hu

A document is often full of class-independent "general" words and short of class-specific "core " words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.

文档中往往充斥着与类无关的“一般”词，而缺乏特定于类的“核心”词，这就给文档聚类带来了困难。我们认为，在聚类方法中对文档模型进行适当的平滑处理，在分割方法中对聚类模型进行适当的平滑处理，可以缓解这两个问题，从而提高聚类质量。据我们所知，大多数基于模型的聚类方法使用拉普拉斯平滑来防止零概率，而大多数基于相似性的方法使用启发式TF*IDF方案来消除“一般”单词的影响。受一系列用于文本检索的统计翻译语言模型的启发，本文提出了一种新的平滑方法——上下文敏感语义平滑。在三个数据集上的对比实验表明，基于模型的语义平滑聚类方法可以有效地提高聚类质量。

引用次数: 26

Adaptive Kernel Principal Component Analysis with Unsupervised Learning of Kernels 基于核无监督学习的自适应核主成分分析

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.14

Daoqiang Zhang, Zhi-Hua Zhou, Songcan Chen

Choosing an appropriate kernel is one of the key problems in kernel-based methods. Most existing kernel selection methods require that the class labels of the training examples are known. In this paper, we propose an adaptive kernel selection method for kernel principal component analysis, which can effectively learn the kernels when the class labels of the training examples are not available. By iteratively optimizing a novel criterion, the proposed method can achieve nonlinear feature extraction and unsupervised kernel learning simultaneously. Moreover, a non-iterative approximate algorithm is developed. The effectiveness of the proposed algorithms are validated on UCI datasets and the COIL-20 object recognition database.

选择合适的核是基于核方法的关键问题之一。大多数现有的核选择方法要求训练样本的类标签是已知的。本文提出了一种核主成分分析的自适应核选择方法，该方法可以在训练样本的类标签不可用的情况下有效地学习核。通过迭代优化新准则，该方法可以同时实现非线性特征提取和无监督核学习。此外，还提出了一种非迭代近似算法。在UCI数据集和COIL-20目标识别数据库上验证了算法的有效性。

引用次数: 14

How Bayesians Debug 贝叶斯算法是如何调试的

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.83

Chao Liu, Zeng Lian, Jiawei Han

Manual debugging is expensive. And the high cost has motivated extensive research on automated fault localization in both software engineering and data mining communities. Fault localization aims at automatically locating likely fault locations, and hence assists manual debugging. A number of fault localization algorithms have been developed in recent years, which prove effective when multiple failing and passing cases are available. However, we notice what is more commonly encountered in practice is the two-sample debugging problem, where only one failing and one passing cases are available. This problem has been either overlooked or insufficiently tackled in previous studies. In this paper, we develop a new fault localization algorithm, named BayesDebug, which simulates some manual debugging principles through a Bayesian approach. Different from existing approaches that base fault analysis on multiple passing and failing cases, BayesDebug only requires one passing and one failing cases. We reason about why BayesDebug fits the two- sample debugging problem and why other approaches do not. Finally, an experiment with a real-world program grep-2.2 is conducted, which exemplifies the effectiveness of BayesDebug.

手动调试是昂贵的。而高成本也促使了软件工程和数据挖掘界对自动故障定位的广泛研究。故障定位旨在自动定位可能的故障位置，从而辅助人工调试。近年来发展了许多故障定位算法，这些算法在存在多个故障和通过的情况下是有效的。然而，我们注意到在实践中更常见的是双样本调试问题，其中只有一个失败和一个通过的情况可用。这一问题在以往的研究中要么被忽视，要么没有得到充分的解决。在本文中，我们开发了一种新的故障定位算法BayesDebug，它通过贝叶斯方法模拟了一些人工调试原理。与现有的基于多个通过和失败案例的故障分析方法不同，BayesDebug只需要一个通过和一个失败案例。我们解释了为什么BayesDebug适合双样本调试问题，而其他方法不适合的原因。最后，对真实世界的grep-2.2程序进行了实验，验证了BayesDebug的有效性。

{"title":"How Bayesians Debug","authors":"Chao Liu, Zeng Lian, Jiawei Han","doi":"10.1109/ICDM.2006.83","DOIUrl":"https://doi.org/10.1109/ICDM.2006.83","url":null,"abstract":"Manual debugging is expensive. And the high cost has motivated extensive research on automated fault localization in both software engineering and data mining communities. Fault localization aims at automatically locating likely fault locations, and hence assists manual debugging. A number of fault localization algorithms have been developed in recent years, which prove effective when multiple failing and passing cases are available. However, we notice what is more commonly encountered in practice is the two-sample debugging problem, where only one failing and one passing cases are available. This problem has been either overlooked or insufficiently tackled in previous studies. In this paper, we develop a new fault localization algorithm, named BayesDebug, which simulates some manual debugging principles through a Bayesian approach. Different from existing approaches that base fault analysis on multiple passing and failing cases, BayesDebug only requires one passing and one failing cases. We reason about why BayesDebug fits the two- sample debugging problem and why other approaches do not. Finally, an experiment with a real-world program grep-2.2 is conducted, which exemplifies the effectiveness of BayesDebug.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115666617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Sixth International Conference on Data Mining (ICDM'06)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀