2008 IEEE International Conference on Data Mining Workshops最新文献

英文中文

A Comparative Study of Data Sampling and Cost Sensitive Learning 数据抽样与代价敏感学习的比较研究

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.119

Chris Seiffert, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano

Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the over represented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.

数据挖掘和机器学习从业者在许多应用领域面临的两个常见挑战是不平等的分类成本和类不平衡。大多数传统的数据挖掘技术试图最大化整体的准确性，而不是最小化成本。当数据不平衡时，这种技术导致模型高度倾向于过度代表的类，这种类通常具有较低的误分类成本。用于解决这两个问题的两种技术是成本敏感学习和数据采样。在这项工作中，我们研究了两种成本敏感学习技术和四种数据采样技术在数据不平衡时最小化分类成本的性能。我们提出了一套全面的实验，利用15个数据集和10个成本比，这些数据集经过精心设计，以确保结论性、显著性和可靠的结果。

引用次数: 54

Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments 分布式环境下数据挖掘的分布式线性规划和资源管理

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.137

Haimonti Dutta, H. Kargupta

Advances in computing and communication has resulted in very large scale distributed environments in recent years. They are capable of storing large volumes of data and often have multiple compute nodes. However, the inherent heterogeneity of data components, the dynamic nature of distributed systems, the need for information synchronization and data fusion over a network and security and access control issues makes the problem of resource management and monitoring a tremendous challenge. In particular, centralized algorithms for management of resources and data may not be sufficient to manage complex distributed systems. In this paper, we present a distributed algorithm for resource and data management which builds on the traditional simplex algorithm used for solving linear optimization problems. Our distributed algorithm is an exact one meaning its results are identical if run in a centralized setting. We provide extensive analytical results and experiments on simulated data to demonstrate the performance of our algorithm.

近年来，计算和通信的进步导致了非常大规模的分布式环境。它们能够存储大量数据，并且通常具有多个计算节点。然而，数据组件固有的异构性、分布式系统的动态性、网络上信息同步和数据融合的需求以及安全性和访问控制问题，使得资源管理和监控问题成为一个巨大的挑战。特别是，用于管理资源和数据的集中式算法可能不足以管理复杂的分布式系统。本文在求解线性优化问题的传统单纯形算法的基础上，提出了一种用于资源和数据管理的分布式算法。我们的分布式算法是精确的，这意味着如果在集中设置中运行，其结果是相同的。我们提供了大量的分析结果和模拟数据实验来证明我们的算法的性能。

引用次数: 19

Remarks to Logical Aspects of Measures of Interestingness of Association Rules 关联规则兴趣度度量的逻辑方面述评

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.45

J. Rauch

Relations of logical calculi of association rules to measures of interestingness of association rules are studied. Logical calculi of association rules, 4ft-quantifiers and important classes of association rules are briefly introduced. New 4ft-quantifiers and association rules are defined by applications of suitable thresholds to several known measures of interestingness. It is proved that some of new 4ft-quantifiers constitute rules that belong to known classes of rules. It is shown that new interesting classes of rules can be defined on the basis of additional new 4ft-quantifiers. Some additional results concerning new classes of rules are proved. Open problems are introduced.

研究了关联规则的逻辑演算与关联规则的兴趣度量之间的关系。简要介绍了关联规则的逻辑演算、4英尺量词和关联规则的重要类别。新的4英尺量词和关联规则通过应用合适的阈值来定义几个已知的兴趣度度量。证明了一些新的4英尺量词构成了属于已知规则类的规则。结果表明，在附加新的4英尺量词的基础上，可以定义新的有趣的规则类。证明了关于新规则类的一些附加结果。引入开放问题。

引用次数: 1

G-REX: A Versatile Framework for Evolutionary Data Mining G-REX:进化数据挖掘的通用框架

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.117

Rikard König, U. Johansson, L. Niklasson

This paper presents G-REX, a versatile data mining framework based on genetic programming. What differs G-REX from other GP frameworks is that it doesn't strive to be a general purpose framework. This allows G-REX to include more functionality specific to data mining like preprocessing, evaluation- and optimization methods, but also a multitude of predefined classification and regression models. Examples of predefined models are decision trees, decision lists, k-NN with attribute weights, hybrid kNN-rules, fuzzy-rules and several different regression models. The main strength is, however, the flexibility, making it easy to modify, extend and combine all of the predefined functionality. G-REX is, in addition, available in a special Weka package adding useful evolutionary functionality to the standard data mining tool Weka.

提出了一种基于遗传规划的通用数据挖掘框架G-REX。G-REX与其他GP框架的不同之处在于，它并不力求成为一个通用框架。这允许G-REX包含更多特定于数据挖掘的功能，如预处理、评估和优化方法，以及大量预定义的分类和回归模型。预定义模型的例子有决策树、决策列表、带有属性权重的k-NN、混合knn规则、模糊规则和几种不同的回归模型。然而，它的主要优点是灵活性，可以很容易地修改、扩展和组合所有预定义的功能。此外，G-REX还包含在一个特殊的Weka包中，为标准数据挖掘工具Weka添加了有用的进化功能。

引用次数: 30

Semantic Full-Text Search with ESTER: Scalable, Easy, Fast 语义全文搜索与ESTER:可扩展，简单，快速

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.101

H. Bast, Fabian M. Suchanek, Ingmar Weber

We present a demo of ESTER, a search engine that combines the ease of use, speed and scalability of full-text search with the powerful semantic capabilities of ontologies. ESTER supports full-text queries, ontological queries and combinations of these, yet its interface is as easy as can be: A standard search field with semantic information provided interactively as one types. ESTER works by reducing all queries to two basic operations: prefix search and join, which can be implemented very efficiently in terms of both processing time and index space.We demonstrate the capabilities of ESTER on a combination of the English Wikipedia with the Yago ontology, with response times below 100 milliseconds for most queries, and an index size of about 4 GB. The system can be run both stand-alone and as a Web application.

我们展示了一个ESTER的演示，这是一个搜索引擎，它结合了全文搜索的易用性、速度和可扩展性以及本体的强大语义功能。ESTER支持全文查询、本体论查询以及这些查询的组合，但它的接口非常简单:一个标准搜索字段，以交互方式提供语义信息。ESTER的工作原理是将所有查询简化为两个基本操作:前缀搜索和连接，这在处理时间和索引空间方面都可以非常有效地实现。我们在英文维基百科和Yago本体的组合上演示了ESTER的功能，大多数查询的响应时间低于100毫秒，索引大小约为4 GB。该系统既可以独立运行，也可以作为Web应用程序运行。

引用次数: 7

Stream-Close: Fast Mining of Closed Frequent Itemsets in High Speed Data Streams Stream-Close:高速数据流中封闭频繁项集的快速挖掘

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.51

Ranganath B. N., M. Murty

With the emergence of large-volume and high-speed streaming data, the recent techniques for stream mining of CFIpsilas (closed frequent itemsets) will become inefficient. When concept drift occurs at a slow rate in high speed data streams, the rate of change of information across different sliding windows will be negligible. So, the user wonpsilat be devoid of change in information if we slide window by multiple transactions at a time. Therefore, we propose a novel approach for mining CFIpsilas cumulatively by making sliding width(ges1) over high speed data streams. However, it is nontrivial to mine CFIpsilas cumulatively over stream, because such growth may lead to the generation of exponential number of candidates for closure checking. In this study, we develop an efficient algorithm, stream-close, for mining CFIpsilas over stream by exploring some interesting properties. Our performance study reveals that stream-close achieves good scalability and has promising results.

随着大容量、高速流数据的出现，现有的封闭频繁项集(CFIpsilas, closed frequency itemset)流挖掘技术将变得低效。当概念漂移在高速数据流中缓慢发生时，信息在不同滑动窗口之间的变化率可以忽略不计。因此，如果我们一次滑动多个事务窗口，用户将无法获得信息更改。因此，我们提出了一种通过在高速数据流上设置滑动宽度(ges1)来累积挖掘CFIpsilas的新方法。然而，在数据流中累积挖掘CFIpsilas是很重要的，因为这种增长可能导致生成指数级的闭包检查候选数据。在这项研究中，我们通过探索一些有趣的性质，开发了一种高效的算法，流关闭，用于挖掘流上的CFIpsilas。我们的性能研究表明，stream-close具有良好的可扩展性和良好的效果。

引用次数: 4

Graph-Based Data Mining in Dynamic Networks: Empirical Comparison of Compression-Based and Frequency-Based Subgraph Mining 动态网络中基于图的数据挖掘:基于压缩和基于频率的子图挖掘的经验比较

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.68

C. You, L. Holder, D. Cook

We propose a dynamic graph-based relational mining approach using graph-rewriting rules to learns patterns in networks that structurally change over time. A dynamic graph containing a sequence of graphs over time represents dynamic properties as well as structural properties of the network. Our approach discovers graph-rewriting rules, which describe the structural transformations between two sequential graphs over time, and also learns description rules that generalize over the discovered graph-rewriting rules. The discovered graph-rewriting rules show how networks change over time, and the description rules in the graph-rewriting rules show temporal patterns in the structural changes. We apply our approach to biological networks to understand how the biosystems change over time. Our compression-based discovery of the description rules is compared with the frequent subgraph mining approach using several evaluation metrics.

我们提出了一种基于动态图的关系挖掘方法，使用图重写规则来学习网络中随时间结构变化的模式。包含随时间变化的一系列图的动态图表示网络的动态特性和结构特性。我们的方法发现了图重写规则，这些规则描述了两个顺序图之间随时间的结构转换，并且还学习了对发现的图重写规则进行概括的描述规则。发现的图重写规则显示了网络如何随时间变化，图重写规则中的描述规则显示了结构变化的时间模式。我们将我们的方法应用于生物网络，以了解生物系统如何随时间变化。我们基于压缩的描述规则发现与使用几个评估指标的频繁子图挖掘方法进行了比较。

引用次数: 25

The Set Classification Problem and Solution Methods 集分类问题及其求解方法

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.113

Xia Ning, G. Karypis

This paper focuses on developing classification algorithms for problems in which there is a need to predict the class based on multiple observations (examples) of the same phenomenon (class). These problems give rise to a new classification problem, referred to as set classification, that requires the prediction of a set of instances given the prior knowledge that all the instances of the set belong to the same unknown class. This problem falls under the general class of problems whose instances have class label dependencies. Four methods for solving the set classification problem are developed and studied. The first is based on a straightforward extension of the traditional classification paradigm whereas the other three are designed to explicitly take into account the known dependencies among the instances of the unlabeled set during learning or classification. A comprehensive experimental evaluation of the various methods and their underlying parameters shows that some of them lead to significant gains in performance.

本文的重点是针对需要基于同一现象(类)的多个观察(示例)来预测类别的问题开发分类算法。这些问题产生了一个新的分类问题，称为集合分类，它要求在给定集合的所有实例属于同一未知类的先验知识的情况下预测一组实例。此问题属于其实例具有类标签依赖关系的一般问题。提出并研究了解决集合分类问题的四种方法。第一个是基于传统分类范式的直接扩展，而其他三个是为了在学习或分类过程中明确考虑未标记集合实例之间的已知依赖关系而设计的。对各种方法及其基本参数的综合实验评估表明，其中一些方法可以显著提高性能。

引用次数: 24

Discovering Implicit Redundancies in Network Communications for Detecting Inconsistent Values 发现网络通信中的隐式冗余以检测不一致值

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.15

B. Nassu, T. Nanya, Hiroshi Nakamura

Detecting inconsistent values received in a communication is a challenging problem faced in networked systems. Inconsistent values occur when a message contains incorrect data, even though the syntax is correct and there is no corruption due to transmission errors. In many cases, traditional schemes based on voting protocols or error detection codes cannot be used. An alternative is discovering implicit redundancies, or patterns that model a correct communication, and using these patterns to detect inconsistent values. However, existing techniques do not cover the inputs and sequential patterns needed by this problem. In this paper, we propose a novel technique that considers messages with multiple types and attributes, events involving variables, and a heuristic for reducing redundant information. Experiments show that the discovered redundancies can achieve reasonable error detection coverage in fields where sequential relations exist, without implying in a large number of false alarms or a high latency.

检测通信中接收到的不一致值是网络系统面临的一个具有挑战性的问题。当消息包含不正确的数据时，即使语法正确并且没有由于传输错误造成的损坏，也会出现不一致的值。在许多情况下，基于投票协议或错误检测代码的传统方案无法使用。另一种方法是发现隐式冗余，或为正确通信建模的模式，并使用这些模式检测不一致的值。然而，现有的技术并没有涵盖这个问题所需的输入和顺序模式。在本文中，我们提出了一种考虑具有多种类型和属性的消息、涉及变量的事件以及减少冗余信息的启发式的新技术。实验表明，发现的冗余可以在存在顺序关系的字段中实现合理的错误检测覆盖率，而不会产生大量的误报和高延迟。

引用次数: 0

A Semi-supervised Learning Algorithm for Recognizing Sub-classes 子类识别的半监督学习算法

2008 IEEE International Conference on Data Mining Workshops

Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.129

Ranga Raju Vatsavai, S. Shekhar, B. Bhaduri

In many practical situations it is not feasible to collect labeled samples for all available classes in a domain. Especially in supervised classification of remotely sensed images it is impossible to collect ground truth information over large geographic regions for all thematic classes. As a result often analysts collect labels for aggregate classes (e.g., Forest, Agriculture, Urban). In this paper we present a novel learning scheme that automatically learns sub-classes (e.g., Hardwood, Conifer) from the user given aggregate classes. We model each aggregate class as finite Gaussian mixture instead of classical assumption of unimodal Gaussian per class. The number of components in each finite Gaussian mixture are automatically estimated. A semi-supervised learning is then used to recognize sub-classes by utilizing very few labeled samples per each sub-class and a large number of unlabeled samples. Experimental results on real remotely sensed image classification showed not only improved accuracy in aggregate class classification but the proposed method also recognized sub-classes accurately.

在许多实际情况下，收集一个领域中所有可用类的标记样本是不可行的。特别是在遥感图像的监督分类中，不可能在大的地理区域内收集所有主题类的地面真实信息。因此，分析人员经常收集汇总类的标签(例如，森林、农业、城市)。在本文中，我们提出了一种新的学习方案，可以从用户给定的聚合类中自动学习子类(如硬木，针叶树)。我们将每个聚集类建模为有限高斯混合，而不是经典的每类单峰高斯假设。自动估计每个有限高斯混合中的分量数。然后使用半监督学习来识别子类，每个子类使用很少的标记样本和大量未标记样本。实际遥感图像分类实验结果表明，该方法不仅提高了总体分类的精度，而且还能准确识别子分类。

引用次数: 6

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 IEEE International Conference on Data Mining Workshops

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀