Sixth International Conference on Data Mining (ICDM'06)最新文献

英文中文

Boosting the Feature Space: Text Classification for Unstructured Data on the Web 增强特征空间:Web上非结构化数据的文本分类

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.31

Yang Song, Ding Zhou, Jian Huang, Isaac G. Councill, H. Zha, C. Lee Giles

The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recent years. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words feature selection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracy over the traditional methods for both support vector machines and AdaBoost classifiers.

在大型文档语料库中寻找高效的非结构化文本分类方法是近年来备受关注的问题。传统的词袋表示方法将文档编码为特征向量，导致特征空间稀疏，维数大，难以达到较高的分类精度。本文讨论了对Web上的非结构化文档进行分类的问题。提出了一种利用传统特征约简技术和协同过滤方法来增强文档特征空间的分类方法。与基线词袋特征选择方法相比，该方法产生的特征空间的特征量少了一个数量级。在真实数据和基准语料库上的实验表明，我们的方法比支持向量机和AdaBoost分类器的传统方法都提高了分类精度。

引用次数: 17

Stability Region Based Expectation Maximization for Model-based Clustering 基于稳定域的模型聚类期望最大化

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.152

C. Reddy, H. Chiang, B. Rajaratnam

In spite of the initialization problem, the expectation-maximization (EM) algorithm is widely used for estimating the parameters in several data mining related tasks. Most popular model-based clustering techniques might yield poor clusters if the parameters are not initialized properly. To reduce the sensitivity of initial points, a novel algorithm for learning mixture models from multivariate data is introduced in this paper. The proposed algorithm takes advantage of TRUST-TECH (TRansformation Under STability- reTaining Equilibra CHaracterization) to compute neighborhood local maxima on likelihood surface using stability regions. Basically, our method coalesces the advantages of the traditional EM with that of the dynamic and geometric characteristics of the stability regions of the corresponding nonlinear dynamical system of the log-likelihood function. Two phases namely, the EM phase and the stability region phase, are repeated alternatively in the parameter space to achieve improvements in the maximum likelihood. Though applied to Gaussian mixtures in this paper, our technique can be easily generalized to any other parametric finite mixture model. The algorithm has been tested on both synthetic and real datasets and the improvements in the performance compared to other approaches are demonstrated. The robustness with respect to initialization is also illustrated experimentally.

尽管存在初始化问题，但期望最大化(EM)算法在许多数据挖掘相关任务中被广泛用于参数估计。如果没有正确初始化参数，大多数流行的基于模型的聚类技术可能产生较差的聚类。为了降低初始点的敏感性，提出了一种从多变量数据中学习混合模型的新算法。该算法利用TRUST-TECH (TRansformation Under STability- preserving equilibrium CHaracterization)方法，利用稳定域计算似然曲面上的邻域局部最大值。基本上，我们的方法将传统电磁的优点与相应的非线性动力系统的对数似然函数的稳定区域的动态和几何特征结合起来。在参数空间中交替重复两个相位，即EM相位和稳定区相位，以实现最大似然的改进。虽然本文应用于高斯混合模型，但我们的技术可以很容易地推广到任何其他参数有限混合模型。该算法已在合成数据集和真实数据集上进行了测试，并证明了与其他方法相比，该算法的性能有所提高。对初始化的鲁棒性也进行了实验验证。

{"title":"Stability Region Based Expectation Maximization for Model-based Clustering","authors":"C. Reddy, H. Chiang, B. Rajaratnam","doi":"10.1109/ICDM.2006.152","DOIUrl":"https://doi.org/10.1109/ICDM.2006.152","url":null,"abstract":"In spite of the initialization problem, the expectation-maximization (EM) algorithm is widely used for estimating the parameters in several data mining related tasks. Most popular model-based clustering techniques might yield poor clusters if the parameters are not initialized properly. To reduce the sensitivity of initial points, a novel algorithm for learning mixture models from multivariate data is introduced in this paper. The proposed algorithm takes advantage of TRUST-TECH (TRansformation Under STability- reTaining Equilibra CHaracterization) to compute neighborhood local maxima on likelihood surface using stability regions. Basically, our method coalesces the advantages of the traditional EM with that of the dynamic and geometric characteristics of the stability regions of the corresponding nonlinear dynamical system of the log-likelihood function. Two phases namely, the EM phase and the stability region phase, are repeated alternatively in the parameter space to achieve improvements in the maximum likelihood. Though applied to Gaussian mixtures in this paper, our technique can be easily generalized to any other parametric finite mixture model. The algorithm has been tested on both synthetic and real datasets and the improvements in the performance compared to other approaches are demonstrated. The robustness with respect to initialization is also illustrated experimentally.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115766098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fast Relevance Discovery in Time Series 时间序列中的快速相关性发现

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.71

Chang-Shing Perng, Haixun Wang, Sheng Ma

In this paper, we propose to model time series from a new angle: state transition points. When fluctuation of values in a time series crosses a certain point, it may trigger state transition in the system, which may lead to abrupt changes in many other time series. The concept of state transition points is essential in understanding the behavior of the time series and the behavior of the system. The new measure is robust and is capable of discovering correlations that Pearson's coefficient cannot reveal. We propose efficient algorithms to identify state transition points and to compute correlation between two time series. We also introduce some triangular inequalities to efficiently find highly correlated time series among many time series.

在本文中，我们提出了一个新的角度来建模时间序列:状态转移点。当一个时间序列中值的波动超过某一点时，可能会触发系统的状态转变，从而导致许多其他时间序列的突变。状态转移点的概念对于理解时间序列的行为和系统的行为是必不可少的。新的测量方法是稳健的，能够发现皮尔逊系数无法揭示的相关性。我们提出了有效的算法来识别状态转移点和计算两个时间序列之间的相关性。为了在众多时间序列中有效地找到高度相关的时间序列，我们还引入了一些三角不等式。

引用次数: 4

Resource Management for Networked Classifiers in Distributed Stream Mining Systems 分布式流挖掘系统中网络分类器的资源管理

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.136

D. Turaga, O. Verscheure, U. Chaudhari, Lisa Amini

Networks of classifiers are capturing the attention of system and algorithmic researchers because they offer improved accuracy over single model classifiers, can be distributed over a network of servers for improved scalability, and can be adapted to available system resources. This work provides a principled approach for the optimized allocation of system resources across a networked chain of classifiers. We begin with an illustrative example of how complex classification tasks can be decomposed into a network of binary classifiers. We formally define a global performance metric by recursively collapsing the chain of classifiers into one combined classifier. The performance metric trades off the end-to-end probabilities of detection and false alarm, both of which depend on the resources allocated to each individual classifier. We formulate the optimization problem and present optimal resource allocation results for both simulated and state-of-the-art classifier chains operating on telephony data.

分类器网络正在引起系统和算法研究人员的注意，因为它们比单一模型分类器提供更高的准确性，可以分布在服务器网络上以提高可伸缩性，并且可以适应可用的系统资源。这项工作为跨分类器网络链的系统资源优化分配提供了一种原则性的方法。我们从一个说明性示例开始，说明如何将复杂的分类任务分解为二元分类器网络。我们通过递归地将分类器链折叠成一个组合分类器来正式定义全局性能度量。性能指标权衡了端到端的检测概率和假警报概率，这两者都依赖于分配给每个分类器的资源。我们制定了优化问题，并提出了在电话数据上操作的模拟和最先进的分类器链的最佳资源分配结果。

引用次数: 23

Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining 基于最近邻算法的任意时间分类及其在流挖掘中的应用

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.21

Ken Ueno, X. Xi, Eamonn J. Keogh, Dah-Jye Lee

For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.

对于许多现实世界的问题，我们必须在大量不同的计算资源下执行分类。例如，如果要求对从突发流中获取的实例进行分类，我们可能有几毫秒到几分钟的时间来返回类预测。对于这类问题，随时算法可能特别有用。在这项工作中，我们展示了如何将无处不在的最近邻分类器转换为可以产生即时分类的任何时间算法，或者如果给予额外的时间，可以利用额外的时间来提高分类精度。我们通过对来自不同领域的数据进行一组全面的实验来证明我们的方法的实用性。

引用次数: 115

TOP-COP: Mining TOP-K Strongly Correlated Pairs in Large Databases TOP-COP:大型数据库中TOP-K强相关对的挖掘

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.161

Hui Xiong, Mark Brodie, Sheng Ma

Recently, there has been considerable interest in computing strongly correlated pairs in large databases. Most previous studies require the specification of a minimum correlation threshold to perform the computation. However, it may be difficult for users to provide an appropriate threshold in practice, since different data sets typically have different characteristics. To this end, we propose an alternative task: mining the top-k strongly correlated pairs. In this paper, we identify a 2-D monotone property of an upper bound of Pearson's correlation coefficient and develop an efficient algorithm, called TOP-COP to exploit this property to effectively prune many pairs even without computing their correlation coefficients. Our experimental results show that the TOP-COP algorithm can be orders of magnitude faster than brute-force alternatives for mining the top-k strongly correlated pairs.

最近，人们对在大型数据库中计算强相关对产生了相当大的兴趣。大多数先前的研究需要指定最小相关阈值来执行计算。但是，在实践中，用户可能很难提供适当的阈值，因为不同的数据集通常具有不同的特征。为此，我们提出了一个替代任务:挖掘top-k强相关对。在本文中，我们确定了皮尔逊相关系数上界的二维单调性，并开发了一种称为TOP-COP的有效算法来利用这一性质，即使不计算它们的相关系数，也能有效地修剪许多对。我们的实验结果表明，在挖掘top-k强相关对时，TOP-COP算法可以比蛮力算法快几个数量级。

引用次数: 33

Using an Ensemble of One-Class SVM Classifiers to Harden Payload-based Anomaly Detection Systems 基于单类支持向量机分类器集成的有效载荷异常检测系统

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.165

R. Perdisci, G. Gu, Wenke Lee

Unsupervised or unlabeled learning approaches for network anomaly detection have been recently proposed. In particular, recent work on unlabeled anomaly detection focused on high speed classification based on simple payload statistics. For example, PAYL, an anomaly IDS, measures the occurrence frequency in the payload of n-grams. A simple model of normal traffic is then constructed according to this description of the packets' content. It has been demonstrated that anomaly detectors based on payload statistics can be "evaded" by mimicry attacks using byte substitution and padding techniques. In this paper we propose a new approach to construct high speed payload-based anomaly IDS intended to be accurate and hard to evade. We propose a new technique to extract the features from the payload. We use a feature clustering algorithm originally proposed for text classification problems to reduce the dimensionality of the feature space. Accuracy and hardness of evasion are obtained by constructing our anomaly-based IDS using an ensemble of one-class SVM classifiers that work on different feature spaces.

网络异常检测的无监督或无标记学习方法最近被提出。特别是，最近在无标记异常检测方面的工作主要集中在基于简单负载统计的高速分类上。例如，PAYL是一个异常IDS，它测量n-gram有效负载中的发生频率。然后根据对数据包内容的描述构建一个简单的正常流量模型。研究表明，基于有效载荷统计的异常检测器可以通过使用字节替换和填充技术的模仿攻击“逃避”。在本文中，我们提出了一种新的方法来构建高速的基于有效载荷的异常IDS，以达到准确和难以逃避的目的。我们提出了一种从有效载荷中提取特征的新技术。我们使用最初为文本分类问题提出的特征聚类算法来降低特征空间的维数。通过使用在不同特征空间上工作的一类支持向量机分类器的集合构建基于异常的IDS，获得了回避的准确性和硬度。

引用次数: 243

GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space GraphRank:特征空间中显著子图的统计建模和挖掘

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.79

Huahai He, Ambuj K. Singh

We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the p-value of its observed support. We develop efficient methods for computing p-values and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of item- set mining where counts are associated with items and the goal is to find significant sub-vectors. We present an algorithm that explores closed frequent sub-vectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.

我们提出了一种评估数据库中频繁子图的统计显著性的技术。图由特征向量表示，特征向量是一组基本元素上的直方图。基元素的集合是根据领域知识选择的，通常由顶点、边或小图组成。将给定的子图转换为特征向量，通过考虑对应向量出现的显著性来计算子图的显著性。向量在随机向量中出现的概率是基于基元素的先验概率来计算的。然后用它在随机向量数据库中获得支持向量的概率分布。然后将向量/子图的统计显著性定义为其观察到的支持度的p值。我们开发了计算p值和下界的有效方法。为了提高效率，进一步提出了一种简化模型。我们还解决了特征向量挖掘的问题，这是一种项目集挖掘的泛化，其中计数与项目相关联，目标是找到重要的子向量。我们提出了一种探索闭合频繁子向量以找到有效子向量的算法。实验结果表明，该方法对频繁子图的统计显著性排序是有效的、高效的和有用的。

{"title":"GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space","authors":"Huahai He, Ambuj K. Singh","doi":"10.1109/ICDM.2006.79","DOIUrl":"https://doi.org/10.1109/ICDM.2006.79","url":null,"abstract":"We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the p-value of its observed support. We develop efficient methods for computing p-values and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of item- set mining where counts are associated with items and the goal is to find significant sub-vectors. We present an algorithm that explores closed frequent sub-vectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121542322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Temporal Data Mining in Dynamic Feature Spaces 动态特征空间中的时态数据挖掘

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.157

B. Wenerstrom, C. Giraud-Carrier

Many interesting real-world applications for temporal data mining are hindered by concept drift. One particular form of concept drift is characterized by changes to the underlying feature space. Seemingly little has been done in this area. This paper presents FAE, an incremental ensemble approach to mining data subject to such concept drift. Empirical results on large data streams demonstrate promise.

时间数据挖掘的许多有趣的实际应用程序都受到概念漂移的阻碍。概念漂移的一种特殊形式是以底层特征空间的变化为特征的。在这方面似乎做得很少。本文提出了FAE，一种增量集成方法来挖掘受这种概念漂移影响的数据。大数据流的实证结果显示了前景。

引用次数: 43

Speedup Clustering with Hierarchical Ranking 用层次排序加速聚类

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.151

Jianjun Zhou, J. Sander

Many clustering algorithms in particular hierarchical clustering algorithms do not scale-up well for large data-sets especially when using an expensive distance function. In this paper, we propose a novel approach to perform approximate clustering with high accuracy. We introduce the concept of a pairwise hierarchical ranking to efficiently determine close neighbors for every data object. Empirical results on synthetic and real-life data show a speedup of up to two orders of magnitude over OPTICS while maintaining a high accuracy and up to one order of magnitude over the previously proposed DATA BUBBLES method, which also tries to speedup OPTICS by trading accuracy for speed.

许多聚类算法，特别是分层聚类算法不能很好地扩展大型数据集，特别是当使用昂贵的距离函数时。在本文中，我们提出了一种新的方法来执行高精度的近似聚类。我们引入了两两层次排序的概念，以有效地确定每个数据对象的近邻。合成和现实数据的经验结果表明，在保持高精度的同时，光学的速度提高了两个数量级，比先前提出的data BUBBLES方法提高了一个数量级，后者也试图通过牺牲精度来提高速度。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Sixth International Conference on Data Mining (ICDM'06)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀