Sixth International Conference on Data Mining (ICDM'06)最新文献

英文中文

Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment 价值投资中共聚类股票与财务比率的最大拟曲线挖掘

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.111

Kelvin Sim, Jinyan Li, V. Gopalkrishnan, Guimei Liu

We introduce an unsupervised process to co-cluster groups of stocks and financial ratios, so that investors can gain more insight on how they are correlated. Our idea for the co-clustering is based on a graph concept called maximal quasi-bicliques, which can tolerate erroneous or/and missing information that are common in the stock and financial ratio data. Compared to previous works, our maximal quasi-bicliques require the errors to be evenly distributed, which enable us to capture more meaningful co-clusters. We develop a new algorithm that can efficiently enumerate maximal quasi-bicliques from an undirected graph. The concept of maximal quasi-bicliques is domain-independent; it can be extended to perform co-clustering on any set of data that are modeled by graphs.

我们引入了一个无监督的过程来共同聚集股票和财务比率组，这样投资者就可以更深入地了解它们是如何相关的。我们的共聚类想法是基于一个称为最大准双曲线的图概念，它可以容忍股票和财务比率数据中常见的错误或/和缺失信息。与以前的工作相比，我们的最大拟双曲线要求误差均匀分布，这使我们能够捕获更有意义的共簇。提出了一种从无向图中有效枚举极大拟双曲线的新算法。极大拟双曲线的概念是域无关的;它可以扩展到对任何由图建模的数据集执行共聚类。

引用次数: 57

Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners 校正分类:分类器集成与校正和多样化的基础学习器

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.45

Yan Zhang, Xingquan Zhu, Xindong Wu

Empirical studies on supervised learning have shown that ensembling methods lead to a model superior to the one built from a single learner under many circumstances especially when learning from imperfect, such as biased or noise infected, information sources. In this paper, we provide a novel corrective classification (C2) design, which incorporates error detection, data cleansing and Bootstrap sampling to construct base learners that constitute the classifier ensemble. The essential goal is to reduce noise impacts and eventually enhance the learners built from noise corrupted data. We further analyze the importance of both the accuracy and diversity of base learners in ensembling, in order to shed some light on the mechanism under which C2 works. Experimental comparisons will demonstrate that C2 is not only superior to the learner built from the original noisy sources, but also more reliable than bagging or the aggressive classifier ensemble (ACE), which are two degenerate components/variants of C2.

关于监督学习的实证研究表明，在许多情况下，集成方法产生的模型优于单个学习者建立的模型，特别是在从不完善的信息源(如有偏见或受噪声影响的信息源)中学习时。在本文中，我们提供了一种新的校正分类(C2)设计，它结合了错误检测，数据清洗和Bootstrap采样来构建构成分类器集成的基础学习器。基本目标是减少噪声的影响，并最终增强从噪声损坏的数据中构建的学习器。我们进一步分析了基础学习器的准确性和多样性在集成中的重要性，以揭示C2的工作机制。实验比较将证明，C2不仅优于从原始噪声源构建的学习器，而且比bagging或侵略性分类器集成(ACE)更可靠，后者是C2的两个退化组件/变体。

引用次数: 4

Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data Bregman气泡聚类:一个鲁棒的、可扩展的框架，用于定位数据中的多个密集区域

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.32

Gunjan Gupta, Joydeep Ghosh

In traditional clustering, every data point is assigned to at least one cluster. On the other extreme, one class clustering algorithms proposed recently identify a single dense cluster and consider the rest of the data as irrelevant. However, in many problems, the relevant data forms multiple natural clusters. In this paper, we introduce the notion of Bregman bubbles and propose Bregman bubble clustering (BBC) that seeks k dense Bregman bubbles in the data. We also present a corresponding generative model, soft BBC, and show several connections with Bregman clustering, and with a one class clustering algorithm. Empirical results on various datasets show the effectiveness of our method.

在传统聚类中，每个数据点至少分配给一个聚类。在另一个极端，最近提出的一类聚类算法识别一个单一的密集簇，并认为其余的数据是不相关的。然而，在许多问题中，相关数据会形成多个自然聚类。本文引入了Bregman气泡的概念，并提出了在数据中寻找k个密集Bregman气泡的Bregman气泡聚类方法(BBC)。我们还提出了一个相应的生成模型，软BBC，并展示了与Bregman聚类和一类聚类算法的几个联系。在不同数据集上的实证结果表明了该方法的有效性。

引用次数: 18

Recommendation on Item Graphs 项目图推荐

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.133

Fei Wang, Shengchao Ma, Liuzhong Yang, Ta-Hsin Li

A novel scheme for item-based recommendation is proposed in this paper. In our framework, the items are described by an undirected weighted graph Q = (V,epsiv). V is the node set which is identical to the item set, and epsiv is the edge set. Associate with each edge eij isin epsiv is a weight omegaij ges 0, which represents similarity between items i and j. Without the loss of generality, we assume that any user's ratings to the items should be sufficiently smooth with respect to the intrinsic structure of the items, i.e., a user should give similar ratings to similar items. A simple algorithm is presented to achieve such a smooth solution. Encouraging experimental results are provided to show the effectiveness of our method.

提出了一种新的基于项目的推荐方案。在我们的框架中，项目用无向加权图Q = (V,epsiv)来描述。V是与项目集相同的节点集，epsiv是边集。与每条边相关联的是一个权值，它表示项目i和j之间的相似性。在不损失一般性的情况下，我们假设任何用户对项目的评分相对于项目的内在结构应该是足够平滑的，即用户应该对类似的项目给出类似的评分。提出了一种简单的算法来实现这种平滑解。实验结果表明了该方法的有效性。

引用次数: 44

A Framework for Regional Association Rule Mining in Spatial Datasets 空间数据集区域关联规则挖掘框架

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.5

W. Ding, C. Eick, Jing Wang, Xiaojing Yuan

The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. One of the special challenges for spatial data mining is that information is usually not uniformly distributed in spatial datasets. Consequently, the discovery of regional knowledge is of fundamental importance for spatial data mining. This paper centers on discovering regional association rules in spatial datasets. In particular, we introduce a novel framework to mine regional association rules relying on a given class structure. A reward-based regional discovery methodology is introduced, and a divisive, grid-based supervised clustering algorithm is presented that identifies interesting subregions in spatial datasets. Then, an integrated approach is discussed to systematically mine regional rules. The proposed framework is evaluated in a real-world case study that identifies spatial risk patterns from arsenic in the Texas water supply.

地理参考数据的巨大爆炸要求有效地发现空间知识。空间数据挖掘面临的一个特殊挑战是信息在空间数据集中通常不均匀分布。因此，区域知识的发现对空间数据挖掘具有重要的基础意义。本文主要研究如何在空间数据集中发现区域关联规则。特别是，我们引入了一个新的框架来挖掘依赖于给定类结构的区域关联规则。介绍了一种基于奖励的区域发现方法，并提出了一种基于网格的监督聚类算法，用于识别空间数据集中感兴趣的子区域。然后，讨论了一种系统挖掘区域规则的综合方法。提出的框架在一个现实世界的案例研究中进行了评估，该案例研究确定了德克萨斯州供水中砷的空间风险模式。

引用次数: 45

Deploying Approaches for Pattern Refinement in Text Mining 文本挖掘中模式细化的部署方法

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.50

Sheng-Tang Wu, Yuefeng Li, Yue Xu

Text mining is the technique that helps users find useful information from a large amount of digital text documents on the Web or databases. Instead of the keyword-based approach which is typically used in this field, the pattern-based model containing frequent sequential patterns is employed to perform the same concept of tasks. However, how to effectively use these discovered patterns is still a big challenge. In this study, we propose two approaches based on the use of pattern deploying strategies. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCVI and the results show that the effectiveness is improved by using our proposed pattern refinement approaches.

文本挖掘是一种帮助用户从Web或数据库上的大量数字文本文档中找到有用信息的技术。与该领域通常使用的基于关键字的方法不同，采用包含频繁顺序模式的基于模式的模型来执行相同的任务概念。然而，如何有效地使用这些发现的模式仍然是一个很大的挑战。在本研究中，我们基于模式部署策略的使用提出了两种方法。在路透社数据集RCVI上研究了文本挖掘的模式部署算法的性能，结果表明使用我们提出的模式优化方法提高了算法的有效性。

引用次数: 150

Adaptive Blocking: Learning to Scale Up Record Linkage 自适应阻塞:学习扩展记录链接

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.13

M. Bilenko, B. Kamath, R. Mooney

Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.

许多数据挖掘任务需要计算对象对之间的相似性。对相似度计算在记录链接系统以及聚类和模式映射算法中尤为重要。由于对象对的数量随着数据集的大小呈二次增长，计算所有对象对之间的相似性是不切实际的，并且对于大型数据集和复杂的相似性函数来说是不切实际的。阻塞方法通过有效地选择近似相似的对象对进行后续距离计算，而忽略其余不相似的对象对，从而缓解了这一问题。先前提出的阻塞方法需要手动构建基于索引的相似函数或选择一组谓词，然后手动调整参数。本文介绍了一种高效、准确的自动学习块函数的自适应框架。我们描述了两种基于谓词的可学习块函数公式，并提供了训练它们的学习算法。在真实和模拟数据集上证明了所提出技术的有效性，证明它们比非自适应阻塞方法更准确。

{"title":"Adaptive Blocking: Learning to Scale Up Record Linkage","authors":"M. Bilenko, B. Kamath, R. Mooney","doi":"10.1109/ICDM.2006.13","DOIUrl":"https://doi.org/10.1109/ICDM.2006.13","url":null,"abstract":"Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133049729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 273

An Efficient Reference-Based Approach to Outlier Detection in Large Datasets 基于参考的大型数据集离群点检测方法

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.17

Yaling Pei, Osmar R Zaiane, Yong Gao

A bottleneck to detecting distance and density based outliers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pairwise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this approximation is 0(Rn log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.

检测基于距离和密度的离群值的瓶颈是需要对每个数据点进行最近邻搜索，导致成对距离评估的次数为二次。在本文中，我们提出了一种新的方法，使用相对于一组固定参考点的密度度来近似密度度，密度度是根据数据点的最近邻来定义的。基于这个近似的算法的运行时间是0(Rn log n)，其中n是数据集的大小，R是参考点的数量。候选离群值根据分配给每个数据点的离群值评分进行排名。理论分析和实证研究表明，我们的方法是有效的、高效的，并且对非常大的数据集具有很高的可扩展性。

引用次数: 57

Latent Friend Mining from Blog Data 从博客数据中挖掘潜在朋友

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.95

Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen

The rapid growth of blog (also known as "weblog") data provides a rich resource for social community mining. In this paper, we put forward a novel research problem of mining the latent friends of bloggers based on the contents of their blog entries. Latent friends are defined in this paper as people who share the similar topic distribution in their blogs. These people may not actually know each other, but they have the interest and potential to find each other out. Three approaches are designed for latent friend detection. The first one, called cosine similarity-based method, determines the similarity between bloggers by calculating the cosine similarity between the contents of the blogs. The second approach, known as topic-based method, is based on the discovery of latent topics using a latent topic model and then calculating the similarity at the topic level. The third one is two-level similarity-based, which is conducted in two stages. In the first stage, an existing topic hierarchy is exploited to build a topic distribution for a blogger. Then, in the second stage, a detailed similarity comparison is conducted for bloggers that are close in interest to each other which are discovered in the first stage. Our experimental results show that both the topic-based and two-level similarity-based methods work well, and the last approach performs much better than the first two. In this paper, we give a detailed analysis of the advantages and disadvantages of different approaches.

快速增长的博客(又称“weblog”)数据为社会社区的挖掘提供了丰富的资源。本文提出了一个基于博客内容挖掘博客潜在好友的新研究问题。本文将潜在朋友定义为在博客中分享相似主题分布的人。这些人可能实际上并不认识对方，但他们有兴趣和潜力去发现对方。设计了三种潜在朋友检测方法。第一种是基于余弦相似度的方法，通过计算博客内容之间的余弦相似度来确定博客之间的相似度。第二种方法是基于主题的方法，它基于使用潜在主题模型发现潜在主题，然后计算主题级别的相似度。第三种是基于两级相似性的方法，分两个阶段进行。在第一阶段，利用现有的主题层次结构为博客构建主题分布。然后，在第二阶段，对第一阶段发现的彼此感兴趣的博主进行详细的相似度比较。我们的实验结果表明，基于主题的方法和基于两级相似度的方法都能很好地工作，并且最后一种方法的性能比前两种方法要好得多。在本文中，我们详细分析了不同方法的优缺点。

{"title":"Latent Friend Mining from Blog Data","authors":"Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen","doi":"10.1109/ICDM.2006.95","DOIUrl":"https://doi.org/10.1109/ICDM.2006.95","url":null,"abstract":"The rapid growth of blog (also known as \"weblog\") data provides a rich resource for social community mining. In this paper, we put forward a novel research problem of mining the latent friends of bloggers based on the contents of their blog entries. Latent friends are defined in this paper as people who share the similar topic distribution in their blogs. These people may not actually know each other, but they have the interest and potential to find each other out. Three approaches are designed for latent friend detection. The first one, called cosine similarity-based method, determines the similarity between bloggers by calculating the cosine similarity between the contents of the blogs. The second approach, known as topic-based method, is based on the discovery of latent topics using a latent topic model and then calculating the similarity at the topic level. The third one is two-level similarity-based, which is conducted in two stages. In the first stage, an existing topic hierarchy is exploited to build a topic distribution for a blogger. Then, in the second stage, a detailed similarity comparison is conducted for bloggers that are close in interest to each other which are discovered in the first stage. Our experimental results show that both the topic-based and two-level similarity-based methods work well, and the last approach performs much better than the first two. In this paper, we give a detailed analysis of the advantages and disadvantages of different approaches.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128657377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

High-Performance Unsupervised Relation Extraction from Large Corpora 大型语料库的高性能无监督关系提取

Sixth International Conference on Data Mining (ICDM'06)

Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.82

Binyamin Rosenfeld, Ronen Feldman

We present URIES - an unsupervised relation identification and extraction system. The system automatically identifies interesting binary relations between entities in the input corpus, and then proceeds to extract a large number of instances of these relations. The system discovers relations by clustering frequently co- occuring pairs of entities, based on the contexts in which they appear. Its complex pattern-based representation of the contexts allows the clustering step to achieve very high precision, sufficient for the clusters to perform as sets of seeds for bootstrapping a high-recall relation extraction process. In a series of experiments we demonstrate the successful performance of URIES and compare it to the two existing systems - a weakly supervised high-recall Web relation extraction system called SRES, and an unsupervised relation identification system that uses a simpler bag-ofwords representation of contexts. The experiments show that URIES performs comparably to SRES, but without any supervision, and that such performance is due to the power of its complex contexts representation and to its novel candidate selection method.

我们提出了一种无监督关系识别和提取系统。系统自动识别输入语料库中实体之间感兴趣的二元关系，然后继续提取这些关系的大量实例。该系统基于实体出现的上下文，通过对频繁出现的实体对进行聚类来发现它们之间的关系。它基于上下文的复杂模式表示允许聚类步骤达到非常高的精度，足以使聚类作为启动高召回关系提取过程的种子集执行。在一系列实验中，我们展示了URIES的成功性能，并将其与现有的两个系统进行了比较——一个是弱监督的高召回Web关系提取系统，称为SRES，另一个是使用更简单的词袋表示上下文的无监督关系识别系统。实验表明，在没有任何监督的情况下，URIES的性能与SRES相当，而这种性能是由于其复杂上下文表示的能力和其新颖的候选选择方法。

引用次数: 24

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Sixth International Conference on Data Mining (ICDM'06)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀